Pseudo relevance feedback (PRF) uses context defined by the top-ranked documents of an initial retrieval to improve a subsequent retrieval. Most prior research has been done in vector space , probabilistic , and language modeling [13, 16, 23] retrieval models.
Recently the field has moved to Transformer-based rerankers  that are more accurate and computationally complex. Most Transformer-based rerankers learn contextualized representations from query-document pairs, but they have two limitations. First, the query-document pair provides limited context for query understanding. Second, most Transformers have computational complexity quadratic to the input sequence length, rendering longer context infeasible.
To overcome these limitations, we propose a PRF method using a graph-based Transformer (PGT). PGT constructs a graph of the query, the candidate document, and the feedback documents. It uses intra-node attention to contextualize the query according to each individual document, and it uses inter-node attention to aggregate information. With the graph approach, PGT can utilize richer relevance context using a configurable number of feedback documents. Its inter-node attention is sparsified, so it also saves computation.
This paper makes two contributions to the study of pseudo relevance feedback in Transformer architectures. First, it investigates several ways of using PRF documents as context for Transformer rerankers. It shows that PGT improves upon non-PRF Transformer rerankers, and that PGT is at least as accurate as Transformer PRF models that use full attention, while reducing computation. Second, it studies the impact of contextual interactions by adjusting the configuration of the graph. It shows that token-level interaction between the query and feedback documents is critical, while document-level interaction is sufficient to aggregate information from multiple documents.
2 Related Work
Pseudo-relevance feedback is a well-studied method of generating more effective queries. Typically pseudo-relevance feedback uses the top-ranked documents to add query terms and set query term weights. Well-known methods include Rocchio , BM25 expansion , relevance models , and KL expansion models [16, 23]. A large body of work studies which documents to use for expansion (e.g., ). Most methods were designed for discrete bag-of-words representations.
Recent research also studies PRF in neural networks. Li et al. present a neural PRF framework that uses a feed forward network to combine the relevance scores of feedback documents. Only marginal improvement was observed over simple score summation, indicating that the framework does not make the best use of the feedback documents’ information.
Recently, pre-trained Transformer  language models, such as BERT , have improved the state-of-the-art for ad hoc retrieval. Most Transformer-based rerankers are applied to individual query-document pairs. Some research explores jointly modeling multiple top retrieved documents in a Transformer architecture for question clarification , question answering [10, 14] or code generation . The effectiveness of using top retrieved documents in Transformer rerankers remains to be studied.
While the Transformer-based architectures have achieved state-of-the-art results in multiple natural language tasks , the original self-attention mechanism incurs computational complexity quadratic to the length of the input sequence. Therefore, much recent work studies sparsifying Transformer attention [1, 2, 24]. Among these models, Transformer-XH  features an underlying graph structure, where each node represents a text sequence, which makes it a good candidate for multi-sequence tasks such as PRF.
Transformer-XH employs full-attention within each sequence, but it sparsifies inter-sequence attention. Specifically, for each document sequence , the
th layer encoder calculates the intra-sequence, token-level attention by the standard self-attention. Inter-sequence, document-level attentions are calculated using the hidden representations of each sequence’s first token[CLS]:
, where are the neighboring document sequences of in the graph. This allows the [CLS] token to carry context from other neighboring sequences. Such information is propagated to other tokens in the sequence through the intra-sequence attention in the next layer. Hence Transformer-XH outputs a condensed representation that contains both the global graph-level information and the local sequence-level information.
3 Proposed Method
We propose PGT, a PRF reranker with a graph-based Transformer. Given a query , a candidate document , and feedback documents , …, retrieved by a first-stage retrieval algorithm, the goal is to predict the score of by aggregating information from feedback documents. To achieve this goal, PGT adopts the Transformer-XH  architecture, and builds a graph of , and , …, . Figure 1 illustrates the graph.
PGT has two types of nodes. The nodes contextualize the query using feedback documents. As shown in Fig. 1 (right), the input to a node is the text of , with and prepended in order to extract information specific for predicting the relevance between and . The input text sequence is fed into a Transformer module with standard token-level self-attention. To distinguish different parts of the input, we associate segment id 0 with and , and 1 with . In addition to the feedback document nodes, PGT also adopts a special node for the query-candidate pair . The input of the node is the concatenation of the query and candidate document, which constitutes a typical input sequence to existing Transformer-based rerankers. We hypothesize that the node will help the model focus more on the query-candidate pair.
PGT aggregates sequence-level information through inter-sequence attention. Within the sequence, the Transformer encodes the [CLS] token to represent the whole sequence (Fig 1 right). Between the sequences, all [CLS] tokens attend to each other to gather information from other sequences (Fig 1 left). We follow Zhao et al.  and incorporate inter-sequence attention in the last three Transformer encoder layers. The model is trained on a binary relevance classification task using cross-entropy loss, and it predicts the final relevance score using a weighted sum of all the [CLS] representations .
4 Experimental Setup
This section describes our datasets, baselines and other experimental settings.
Experiments were done with the MS MARCO Passage Ranking task dataset 
. It contains about 8.8 million passages and about 0.5 million queries with relevance judgments as training data. Each query has an equal number of relevant and non-relevant passages. We used the official evaluation query set from the TREC 2019 Deep Learning Track. It contains 43 test queries manually annotated by NIST on a four-point scale. On average, a query has 95 relevant documents. We report NDCG@10, MAP@10, and MAP@100.
We compare PGT to initial rankers, a non-PRF reranker, and PRF models.
CRM (initial ranker): This model combines BM25’s lexical retrieval and BERT’s dense embedding retrieval. It performs significantly better than BM25 on our dataset. We used the rankings provided by Gao, et al. .
BERT reranker (non-PRF reranker): This is a standard BERT reranker, whose input is the concatenated sequence of the query and the candidate document . We trained the model following Nogueira and Cho. .
BERT PRF (PRF): This is the same as BERT reranker except that we concatenate to form a PRF input sequence, with documents separated by [SEP]. Limited by the input length constraint of BERT , we used 5 feedback documents. Same as for PGT, we used segment id 0 for and , and 1 for .
4.3 PGT Graph Variants
Modeling queries and documents in a graph gives control over how representations are contextualized. We examined 5 graph variants to study this effect.
PGT base is the graph described in Section 3. The query is first contextualized by the candidate and feedback document at the token-level. Feedback information is then aggregated following the graph structure. The node emphasizes and at the graph-level. This variant has the richest context.
PGT w/o pre removes prepended candidate from the nodes, so each query is only contextualized by the feedback document at sequence-level.
PGT w/o pre removes both the prepended query and the prepended candidate from the feedback nodes. Each feedback document hence only contextualizes the query at the graph-level.
PGT w/o node removes candidate from the node, so only is emphasized again at the graph-level.
PGT w/o node removes the node from the graph, so and are not emphasized again at the graph-level.
4.4 Training and Evaluation
We implement PGT based on the Transformer-XH PyTorch implementation. The parameters for the intra-sequence attention are initialized from a pre-trained BERT base model , and those for the inter-sequence attention are initialized according to Xavier et al. 
. We train the model for 2 epochs, with per-GPU batch size = 4 on 2 GPUs. The maximum node sequence length is 128, and the learning rate is 5e-6 with linear decay.
We train both BERT PRF and PGT using feedback documents from BM25. In order to test how Transformer-based PRF models generalize when different initial rankers are used, we evaluate them using both BM25 and CRM. We follow prior research [7, 18] and report the results at each model’s best reranking depth (Table 1).
5 Experimental Results
PRF vs. non-PRF Transformers We study the effectiveness of PRF in Transformer-based models by comparing PGT and BERT PRF with BERT reranker. Table 1 shows that all PRF Transformers outperform BERT reranker on MAP@10 using either initial ranker. In particular, PGT achieves MAP@10 and better than BERT reranker on BM25 and CRM respectively, with comparable NDCG@10. The results suggest that the richer context provided by PRF helps Transformers rank relevant documents to the very top.
PRF enables Transformers to exploit high-quality initial rankings better. Comparing BM25 and CRM results in Table 1, we found that when the initial ranker is stronger, PGT achieves the best performance across all metrics, closely followed by BERT PRF. In comparison, BERT reranker cannot make the best of the initial retrieval of CRM, as reported by prior research .
PGT vs. BERT PRF While PGT rankings are at least as good as BERT PRF, it is more computationally efficient. Using for a fair comparison, we calculated the number of multiplication and addition operations. PGT consumes as many operations on each input example compared with BERT PRF. In addition, PGT requires smaller reranking depth (Table 1). Using BM25 as the initial ranker, the computational cost is hence only of BERT PRF’s.
Compared with BERT PRF, PGT allows flexible configurations on the graph structure (Table 1). As discussed in Section 4.3, the graph structure controls how relevance context flows across the graph. Contrary to our initial intuition, removing the node partially or entirely (PGT w/o node and PGT w/o node ) achieves the best results among all graph variants. is an impoverished description of the information need compared to feedback documents , which may explain why the comparison of to is less useful than comparisons between and high-quality documents.
The number of feedback documents is a parameter that is usually tuned. BERT’s self-attention mechanism restricts the input sequence length, limiting BERT to 5 feedback documents on our dataset. PGT has no such restriction. Our experiments use for PGT because it is more effective (Table 2).
|RM3||0.5180||0.1192||0.3370||1K||– 111CRM jointly trains a hybrid of sparse and dense retrieval models. Running RM3 on CRM is an open question that is beyond the scope of this work.|
|PGT w/o pre||0.6693||0.1523||0.3563||500||0.7146||0.1658||0.4194||20|
|PGT w/o pre||0.6676||0.1468||0.3450||500||0.7005||0.1572||0.4145||20|
|PGT w/o node||0.6840||0.1586||0.3868||500||0.7139||0.1689||0.4192||20|
|PGT w/o node||0.7078||0.1646||0.3819||500||0.7326||0.1654||0.4220||20|
indicate statistical significance over the initial ranker and BERT reranker using t-test with.
Most Transformer-based rerankers learn contextualized representations for query-document pairs, however queries are impoverished descriptions of information needs. This paper presents PGT, a pseudo relevance feedback method that uses a graph-based Transformer. PGT graphs treat feedback documents as additional context and leverage sparse attention to reduce computation, enabling them to use more feedback documents than is practical with BERT-based rerankers.
Experiments show that PGT improves upon non-PRF BERT rerankers. Experiments also show that PGT rankings are at least as good as BERT PRF rerankings, however they are produced more efficiently due to fewer computations per document and fewer documents reranked per query. PGT is robust, delivering effective rankings under varied graph structures and with two rather different initial rankers.
-  (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §2.
-  (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.
-  (2007) Estimation and use of uncertainty in pseudo-relevance feedback. In In Proceedings of the Thirtieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §2.
-  (2020) Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820. Cited by: §4.1.
-  (2020) Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 1533–1536. Cited by: 1st item.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171–4186. Cited by: §2, §2, 5th item, §4.4.
-  (2020) Complement lexical retrieval model withsemantic residual embeddings. arXiv preprint arXiv:2004.13969. Cited by: 2nd item, §4.4, Table 1, §5.
-  (2020) Relevance transformer: generating concise code snippets with relevance feedback. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 2005–2008. Cited by: §2.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, JMLR Proceedings, Vol. 9, pp. 249–256. Cited by: §4.4.
-  (2020) REALM: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §2.
-  (2020) Guided transformer: leveraging multiple external sources for representation learning in conversational search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1131–1140. Cited by: §2.
-  (2004) UMass at TREC 2004: novelty and HARD. In Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, NIST Special Publication, Vol. 500-261. Cited by: 4th item.
-  (2001) Relevance-based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127. Cited by: §1, §2, 4th item.
-  (2019) Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Vol. 1, pp. 6086–6096. Cited by: §2.
NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4482–4491. Cited by: §2.
-  (2014) Revisiting the divergence minimization feedback model. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, pp. 1863–1866. Cited by: §1, §2.
-  (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016, CEUR Workshop Proceedings, Vol. 1773. Cited by: §4.1.
-  (2019) Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085. Cited by: §1, 3rd item, §4.4, Table 1.
-  (2009) The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (4), pp. 333–389. Cited by: §1, §2.
-  (1971) Relevance feedback in information retrieval. In The SMART Retrieval System - Experiments in Automatic Document Processing, pp. 313–323. Cited by: §1, §2.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp. 5998–6008. Cited by: §2.
-  (2017) Anserini: enabling the use of lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1253–1256. Cited by: 1st item.
-  (2001) Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, pp. 403–410. Cited by: §1, §2.
-  (2020) Transformer-XH: multi-evidence reasoning with extra hop attention. In 8th International Conference on Learning Representations, ICLR 2020, Cited by: §2, §3, §3, §4.4.