Log In Sign Up

Is Non-IID Data a Threat in Federated Online Learning to Rank?

In this perspective paper we study the effect of non independent and identically distributed (non-IID) data on federated online learning to rank (FOLTR) and chart directions for future work in this new and largely unexplored research area of Information Retrieval. In the FOLTR process, clients participate in a federation to jointly create an effective ranker from the implicit click signal originating in each client, without the need to share data (documents, queries, clicks). A well-known factor that affects the performance of federated learning systems, and that poses serious challenges to these approaches, is that there may be some type of bias in the way data is distributed across clients. While FOLTR systems are on their own rights a type of federated learning system, the presence and effect of non-IID data in FOLTR has not been studied. To this aim, we first enumerate possible data distribution settings that may showcase data bias across clients and thus give rise to the non-IID problem. Then, we study the impact of each setting on the performance of the current state-of-the-art FOLTR approach, the Federated Pairwise Differentiable Gradient Descent (FPDGD), and we highlight which data distributions may pose a problem for FOLTR methods. We also explore how common approaches proposed in the federated learning literature address non-IID issues in FOLTR. This allows us to unveil new research gaps that, we argue, future research in FOLTR should consider. This is an important contribution to the current state of FOLTR field because, for FOLTR systems to be deployed, the factors affecting their performance, including the impact of non-IID data, need to be thoroughly understood.


page 1

page 2

page 3

page 4

page 6

page 7

page 9

page 10


Client Adaptation improves Federated Learning with Simulated Non-IID Clients

We present a federated learning approach for learning a client adaptable...

FedOS: using open-set learning to stabilize training in federated learning

Federated Learning is a recent approach to train statistical models on d...

Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints

Federated learning systems facilitate training of global models in setti...

What Do We Mean by Generalization in Federated Learning?

Federated learning data is drawn from a distribution of distributions: c...

FLIS: Clustered Federated Learning via Inference Similarity for Non-IID Data Distribution

Classical federated learning approaches yield significant performance de...

Intrinisic Gradient Compression for Federated Learning

Federated learning is a rapidly-growing area of research which enables a...

Bias-Free FedGAN: A Federated Approach to Generate Bias-Free Datasets

Federated Generative Adversarial Network (FedGAN) is a communication-eff...

1. Introduction

Online learning to rank (OLTR) (Hofmann, 2013; Oosterhuis and de Rijke, 2018; Zhuang and Zuccon, 2020; Oosterhuis and de Rijke, 2021) aims to learn effective rankers from users search interactions, i.e., queries and clicks on search engine result pages (SERPs), by iteratively training and updating a production ranker through online interventions. The use of clicks, rather than relevance labels, reduces the high cost and time required to collect labels from editorial teams; it also better aligns with the user’s true preferences than labels provided by third-party judges. The execution of this training process online rather than offline (e.g., as in counterfactual LTR (Jagerman et al., 2019)) addresses issues associated with rapid changes in query intents (Zhuang and Zuccon, 2021).

Traditional OLTR solutions assume the ranker resides on a central server that controls the production of SERPs, including the online intervention made to explore the ranker’s parameter space based on the index and that logs every user interaction (queries, clicks). This architecture, however, is inadequate for search contexts where the data is private or confidential and cannot be shared with the central search service or where users demand their interactions to be private, i.e. not to share clicks on SERPs with the server. Federated OLTR (Kharitonov, 2019; Wang et al., 2021b) (FOLTR) has been canvassed as a solution to such situations. In FOLTR, private user data is kept on the user’s device. The data is used locally within the user device to learn updates to a globally shared ranker. Local updates from all clients in the federated system are then shared to a central server111We note that while the use of a single central server is common among federated learning methods (and certainly is the only setup investigated so far for FOLTR), alternative setups are possible and include peer-to-peer federated systems with no central servers (Lalitha et al., 2019; Roy et al., 2019; Wang et al., 2021a), and federated systems with multiple central servers. (thus without sharing of actual user data), which is responsible for the aggregation of the local updates, the consequent update of the global model and the sharing of the new global model with the clients (see Figure 2 for a concrete example of a FOLTR system). The object of FOLTR is to federatively create a ranker that is more effective than each of the individual rankers users could create on each of the users private data – and ideally this federated ranker should perform as well as a ranker that is created using all user data in a centralised manner.

Research on the effectiveness of the FOLTR paradigm and the factors that affect its performance is still limited to date, with only a couple of proposed and empirically investigated methods (Kharitonov, 2019; Wang et al., 2021c, b). Importantly, research on FOLTR has fully ignored a key issue affecting the performance of federated learning (FL) systems: the presence of bias in how the training data is divided across the clients that join the federation. In other words, the fact that clients may hold non-independent and identically distributed (non-IID) data (Zhu et al., 2021).

Non-IID data can pose severe threat to the effectiveness of a federated learning method. Models trained federatively in the presence of non-IID data across the clients that participate in the federation, in fact, display significantly lower effectiveness, and at times experience difficulties for the model to converge (Zhu et al., 2021). Effectiveness degradation can mainly be attributed to the weight divergence between the local models resulting from the non-IID distribution of the data across the clients (Zhu et al., 2021). Local models with the same initial parameters will converge to different models because of the heterogeneity of the local data distributions. This divergence will increase as more communication rounds of the federated learning algorithm are performed. This slows down or even impedes model convergence, worsening the performance of the global model. An illustration of the phenomenon of model divergence for both IID and non-IID data in federated learning is given in Figure 1. The ideal global model (, under centralised learning) and actual global model (, average model created through FedAvg (McMahan et al., 2017)) coincide when data is IID, but diverge when data is non-IID, showing that this is a sizeable problem when the data is non-IID.

Figure 1. Illustration of the model divergence problem in FL, adapted from Zhu et al. (2021). is the ideal global model under centralised learning, and is the average model created from the local models of Client 1 () and Client 2 () through FedAvg (McMahan et al., 2017).

This perspective paper 222In this paper, if not specified otherwise, we only consider horizontal FL (Yang et al., 2019) and we believe our framework can be applied to both cross-device and cross-silo federated learning (Kairouz et al., 2021). provides a systematic understanding of when non-IID data may occur in the FOLTR setting and the impact of non-IID data in such cases. This crucial research sheds light on the factors that need to be considered when devising and deploying FOLTR methods. It also details the experimental conditions for simulating non-IID data in FOLTR, paving the way for the development and adaptation to OLTR of existing and new methods for dealing with non-IID data. With this regard, we also show how some of the methods proposed in the federated learning literature to deal with non-IID data can be cast in the FOLTR framework and the gaps that still exist in effectively addressing non-IID data in FOLTR.

2. Related Work

2.1. Federated Learning with non-IID data

Zhu et al. have compiled a comprehensive survey on the impact of non-IID data on federated learning  (Zhu et al., 2021), also reviewing the current research on handling these challenges. Early work from Zhao et al. (Zhao et al., 2018) shows a deterioration of the accuracy of federated learning if non-IID or heterogeneous data is present; they also provide a solution to this problem by creating a small subset of globally shared data between all clients (local devices). Li et al. (Li et al., 2020b) analyse the convergence of the federated learning algorithm FedAvg (McMahan et al., 2017) (which is a component of the FOLTR method we rely upon for investigation (Wang et al., 2021b)) on non-IID data and empirically show that data heterogeneity slows down the convergence. This raised attention to the presence of non-IID data in federated learning.

Generally speaking, existing approaches for handling non-IID issues in federated learning can be classified into three categories: data-based approaches, algorithm-based approaches, and system-based approaches 

(Zhu et al., 2021). Data sharing (Zhao et al., 2018) and data augmentation (Duan et al., 2019) are two kinds of typical data-based approaches. While they achieve state-of-the-art performance, they fundamentally conflict with the objective of federated learning: that of not sharing data across clients. This is because, for example, methods such as data sharing require a subset of private data to be shared across all clients. While proposals have been made to use synthetic, rather than real, data for the data sharing mechanism (Wang et al., 2021a) it is unclear (1) what the effectiveness loss of the sharing of synthetic data in place of real data is, and (2) whether the sharing of synthetic data could still jeopardise privacy as this synthetic data is typically generated from real data, and thus analysis of the synthetic data may reveal key aspects of and information contained in the real data. Algorithm-based approaches mainly focus on personalisation methods like local fine-tuning of a neural model (Wang et al., 2019) and Personalized FedAvg (Per-FedAvg) (Fallah et al., 2020) – which are both limited mainly to neural models – or the casting of the federated learning process into a multi-task learning problem (Smith et al., 2017). System based approaches adopt clustering (Sattler et al., 2020) and tree-based structure (Ghosh et al., 2020) to deal with non-IID data. Limitations exist among all proposed approaches, and this is still a much unexplored line of research.

2.2. Federated Learning in IR

We provide an overview of the use of federated learning in OLTR in section 3. That section also introduces the FOLTR method used in the empirical experimentation in this paper: the Federated Pairwise Differentiable Gradient Descent (FPDGD) method (Wang et al., 2021b), which is the current state-of-the-art in FOLTR.

Aside from its usage in OLTR, recent works have applied federated learning in other IR contexts. Zong et al. (Zong et al., 2021) provide a solution for cross-modal retrieval in a distributed data storage scenario, which uses federated learning to reduce the potential privacy risks and the high maintenance costs encountered when dealing with a large amount of training data. Wang et al. (Wang et al., 2021d) study learning to rank (but not OLTR) in a cross-silo federated learning setting; this work is aimed at helping companies that have access to limited labelled data to collaboratively build a document retrieval system efficiently. Hartmann et al. (Hartmann et al., 2019)

use federated learning to improve the ranking of suggestions in the Firefox URL bar, so that the training of the ranker on user interactions is performed in a privacy-preserving way; they show that this federated approach improves on the suggestions produced by the previously employed heuristics in Firefox. Yang et al.  

(Yang et al., 2018) describe the use of federated learning for search query suggestions in the Google Virtual Keyboard (GBoard) product. Here, a baseline model identifies relevant query suggestions given a user query; candidate suggestions are then filtered using a triggering model learnt using federated learning. Closest to FOLTR is the work of  Li and Ouyang (2021), who devise an offline federated learning method for counterfactual learning to rank from historic click logs.

Aside from the previous examples, federated learning has also seen adoption in the area of personalised search (Ghorab et al., 2013), which aims to return search results that cater to the specific user’s interests. While feature-based (Carman et al., 2010; Bennett et al., 2012; Harvey et al., 2013)

and deep learning-based 

(Song et al., 2014; Ge et al., 2018; Yao et al., 2020) methods are widely used in this area, user data privacy has been often overlooked – this is particularly the case when considering the user’s query logs which are collected by the central server to create the personalised ranker. To tackle this issue, Yao et al. (Yao et al., 2021) recently proposed a privacy protection enhanced personalised search framework which adapts federated learning to the state-of-the-art personalised search model. While not directly related to the OLTR context we consider here, these related lines of research could benefit from the investigations and considerations reported in this paper, as the problem of non-IID data in these previous contexts has also been ignored.

Figure 2. Schematic representation of the FOLTR setting.

3. FOLTR Framework and FPDGD

We next briefly describe the FOLTR framework, including the Federated Pairwise Differentiable Gradient Descent (FPDGD) method (Wang et al., 2021b), which represents the current state-of-the-art in FOLTR and that we use as a representative method in our experiments to investigate the effect of non-IID data on FOLTR.

The federated online learning to rank setting is pictured in Figure 2. Searchable data is stored by each client (

) and not shared with the centralised server or other clients. Different clients may hold all, a portion of, none of the same searchable data. Queries and user’s clicks occur at a client side (


) and are not communicated to the centralised server or other clients: search is indeed entirely performed on the user device (

). Each client exploits search interactions to perform local model updates to the ranker; for FPDGD, the routine executed by the client is shown in Algorithm 1, and the PDGD update is shown in Algorithm 2. Each client considers interactions before updating the local ranker using the PDGD gradients. These local updates are then shared with the central server (

), which in turn combines the ranker updates from the clients to produce a revised ranker (

); for FPDGD, this is achieved according to the server routine in Algorithm 1. The new global model is then distributed to the user’s device (


0:   Server executes:
  initialize ; scoring function: ; learning rate:
  for each round  do
     for each client in parallel do
   ClientUpdate():   // Run on client
  for each local update from to  do
      //PDGD update shown in Algorithm. 2
  return () to server
Algorithm 1 FederatedAveraging PDGD.
set of clients participating training: , each client is indexed by ;
number of local interactions for client : ()
local interaction set: , model weights: .
1:  Input: initial weights: ; scoring function: ; learning rate .
2:  for   do
3:     // obtain a query from a user
4:     // preselect documents for query
5:      // sample list
6:      // show result list to the user
7:      // initialize gradient
8:     for  do
9:         // initialize pair weight
10:         // pair gradient
11:         // model gradient
12:      // update the ranking model
Algorithm 2 Pairwise Differentiable Gradient Descent(PDGD) (Oosterhuis and de Rijke, 2018)

4. Types of non-IID Data in FOLTR

We consider training a ranker for the OLTR system as a supervised learning task in an FL setup, with each client holding a subset of the data. Each data sample is denoted as

, where is the feature representation of the data and is the label. The local distribution of the dataset in client is denoted as . The presence of non-IID data can be represented as the difference between local data distributions: that is, for different clients and , .

In federated learning, data across clients may not be IID due to different reasons: Kairouz et al. (Kairouz et al., 2021) and Zhu et al. (Zhu et al., 2021) assert this can be due to how features and labels are distributed. However, the translation of these categories to FOLTR is not straightforward. In the following sections, we put forward several situations in which data specific to FOLTR could be distributed in a non-IID manner across clients. Specifically, we consider data in the FOLTR process may not be IID because of biases across clients due to:

  • Type 1: document preferences (Section 5)

  • Type 2:

    document label distribution skewness (Section 


  • Type 3: click preferences (Section 7)

  • Type 4: data quantity (Section 7)

Data type Key characteristic When it happens in FOLTR
Type 1 Document Preferences
Different clients have different preferred candidate documents, although they are
searching for the same query.
Type 2 Document Label Distribution
Different clients hold candidate documents with different label distribution while
the conditional feature distirbution is the shared.
Type 3 Click Preferences
Different clients have various preferred click behaviours when searching for the
same query.
Type 4 Data Quantity
Different clients have different frequency on issuing queries and interacting with
the searching system.
Table 1. Summary of non-IID data types in FOLTR.

The last data type, Type 4, i.e., the situation in which different clients hold different quantities of data (and in particular interaction data such as queries and clicks), does not necessarily imply that the data is non-IID. However, we note this case is often studied in the FL literature alongside non-IID data (Zhu et al., 2021; Li et al., 2022), and thus we include this situation in our considerations of the non-IID problem. Each data type is defined and investigated in the next sections; in addition we provide a summary overview of the data types in Table 1.

We also note that commonly in federated learning, non-IID data occurs because the data is distributed across clients according to its features. In other words, the marginal distribution of the features belonging to the data held by each client may vary, i.e. for different clients and , . This situation may occur in horizontal federated learning settings (also called homogeneous FL) (Yang et al., 2019)

, where each client holds different and overlapping datasets. In this case, the non-IID divergence is usually caused by inconsistent data distributions, e.g., feature imbalance of the training data local to each client. However, this case does not seem applicable to FOLTR (thus is not further studied in this paper). In FOLTR, each data item is represented by the feature vector of a query-document pair and its relevance label. The features often consist of variations of query-dependent features such as TF-IDF scores, BM25 scores, query length, as well as query-independent features such as PageRank, URL lengths, and so on 

(Qin and Liu, 2013). In this case, bias in the feature distribution across clients would be rare as most features are dependent on the query-document pair.

Next, we describe the non-IID data types we put forward in this paper and analyse their impact on FOLTR. We empirically find that only Type 1 and partially also Type 2 data have a strong impact on the FOLTR. We thus predominantly focus our attention on these two data types while providing only a definition and a brief account of the remaining two data types in the paper due to space: we do, however, report all experiments results, thorough analysis and considerations in an online appendix available at

5. Type 1: Document preferences

Document preference skewness (Type 1) considers the situation when the conditional distribution varies across the clients though remains the same. This happens when different clients have different preferred candidate documents, although they are searching for the same query. As OLTR requires the user’s implicit feedback as an optimization objective, which might be highly related to individual preferences, this setting appears to be of very likely occurrence.

Figure 3. Offline performance (nDCG@10) on Type 1 data; results averaged across dataset splits and experimental runs.

5.1. Simulating Type 1 non-IID Data

The mechanism we use to simulate non-IID data of Type 1 and IID data to baseline the FOLTR effectiveness relies on a recent work that empirically studied and demonstrated how OLTR methods adapt when user’s search intents change overtime (Zhuang and Zuccon, 2021). In particular, Zhuang and Zuccon (2021) created a collection for OLTR with several explicit intent types by adapting an existing TREC collection, as no dataset is available for studying this OLTR problem. Derived from ClueWeb09 and the TREC Web Track 2009 to 2012 (Clarke et al., 2012), this intent change collection consists of 200 queries with 4 intents each and, on average, 512 candidate documents per query. Furthermore, query-document pairs’ relevance judgements are provided per intent. We believe this is an appropriate collection to adapt to study the effect of Type 1 non-IID data on FOLTR. We can regard each intent as a type of user preference. As the average number of relevant documents per intent varies largely across all intent types, the learning difficulty of optimizing a ranker among different intents also varies. To avoid this bias, we follow Zhuang and Zuccon (Zhuang and Zuccon, 2021) and we re-label the original intent number for each query through random shuffling: this is possible because all intent types are independent across queries. In our experiments, we repeat this process of re-balancing 5 times, thus giving rise to results averaged across 5 FOLTR experiments. We refer to Zhuang and Zuccon (Zhuang and Zuccon, 2021) for further details on the dataset creation, and we further highlight that we have made available an implementation of the dataset creation procedure along with the actual dataset at

To simulate non-IID data, after randomly shuffling all intents across 4 types, we let each intent represent one client preference. The client preferences differ from each other for the same query-document pair so as the corresponding relevance judgements. The federated setup involves 4 clients (represented by 4 types of intent) and the local updating time with fixed global communication times . These settings are similar to those used in previous work on FOLTR (Kharitonov, 2019; Wang et al., 2021c, b) – in particular we refer the interested reader to the work of Wang et al. (Wang et al., 2021c) to understand the relationships between number of clients, number of local updates , and FOLTR effectiveness. For the implicit feedback in FOLTR, we simulate user clicks based on the popular

Simplified Dynamic Bayesian Network

(SDBN) click mode (Chapelle and Zhang, 2009), following settings in previous work on OLTR (Oosterhuis et al., 2016; Oosterhuis and de Rijke, 2018; Zhuang and Zuccon, 2020; Wang et al., 2021b). We limit SERP to 10 documents and use for offline evaluation, cumulative discounted  (Oosterhuis and de Rijke, 2018) for online evaluation. We train a linear ranker and a neural ranker on the intent-change dataset. As in Zhuang and Zuccon (2021), given that no held-out test set is available, we evaluate both online and offline performance on the original training set across all 4 intent types and average all results. For the IID setting, we merge all intents and mark a document as relevant as long as it is judged relevant for at least one of the intent types. Each client randomly picks a query from the training set and clicks documents based on the same preferences during the federated training with IID data. Other settings remain the same as the non-IID experiments.

5.2. Impact of Type 1 non-IID Data

The offline performance related to Type 1 data is shown in Figure 3; the corresponding online performance is shown in Table 2. From the offline performance, it is clear that the presence of non-IID data negatively impacts the performance of the learnt ranker, compared to those obtained when data is IID. In terms of online performance, rankers obtained in the presence of non-IID data are also worse than when trained with IID data. This can be explained as follows. Since each client has its preference (intent), the relevant documents are judged in different ways; this leads to the divergence of each client’s local ranker update, as exemplified in Figure 1.

In summary, we find that if data is distributed in a non-IID manner across clients according to Type 1, the effectiveness of FOLTR (and specifically of FPDGD) is seriously affected.

ranker data types perfect navigational informational
linear IID 1002.36 872.12 894.95
non-IID 648.71 546.25 566.23
neural IID 1061.57 834.08 842.87
non-IID 668.38 505.64 490.29
Table 2. Online performance on Type 1 data, averaged across dataset splits and experimental runs. Significant differences between IID and non-IID are indicated by  (p 0.05)

5.3. Dealing with Type 1 non-IID Data

The employed state-of-the-art FPDGD method is based on the FedAvg algorithm. The fact that FPDGD is affected by non-IID data may be due to the underlying federation algorithm, i.e. FedAvg itself. In federated learning literature, variations of this federation algorithm have been proposed to tackle the non-IID data problem directly. We select two of such methods, FedProx (Li et al., 2020a) and FedPer (Arivazhagan et al., 2019), and adapt them to the FPDGD method.

(a) intent-change (linear ranker) - FedProx
(b) intent-change (neural ranker) - FedProx
(c) intent-change (neural ranker) - FedPer
Figure 4. Offline performance on Type 1 data for FedProx and FedPer; results averaged across dataset splits and experimental runs.

FedProx (Li et al., 2020a) improves the local objective of FedAvg. Specifically, it introduces an additional regularisation term (weighted according to a hyper-parameter ) in the local objective function to limit the distance between the local model and the global model. We provide details of our adaptation of FedProx to FPDGD in the online appendix; the use of FedProx adds little computational overhead. However, the main drawback is that the hyper-parameter needs to be carefully tuned: a large may slow the convergence by forcing the updates to get close to the initial point, while a small may not make much difference compared to the use of FedAvg.

FedPer (Arivazhagan et al., 2019)

tackles the presence of non-IID exclusively for deep neural networks by separating them into base layers and personalisation layers. The base layers are trained collaboratively through FedAvg, where all clients share the same base layers. Instead, the personalisation layers are trained locally using the clients’ local data with stochastic gradient descent (SGD). This procedure works as follows: after initialisation, each client merges and updates its base and personalised layers locally using an SGD style algorithm. Each client only sends its base layers to the global server. The server updates the globally-shared base layers using FedAvg and sends back again the updated ones to each client. Intuitively, the base layers are updated globally to learn common high-level representations. In contrast, the distinct personalisation layers never leave the local device and capture the personalisation aspects required by the clients. Except for the training and the maintenance of the local personalisation layers, FedPer is quite similar to FedAvg. FedPer, however, reduces the communication costs as only part of the whole model is transferred and has shown enhanced learning performance under highly skewed non-IID data 

(Arivazhagan et al., 2019).

Our experimental results on FedProx and FedPer are shown in Figure 4; for FedProx we explored

. The results clearly show that these federated learning methods, which successfully deal with non-IID data in general machine learning tasks, are not effective in the FOLTR context. In fact, not only do these methods not overcome the gap in effectiveness between IID and non-IID setups, but they even only provide limited improvements, if any, compared to FPDGD with FedAvg. This is an important finding because: (1) it shows a realistic case in which non-IID data largely affects FOLTR effectiveness, and (2) it shows that current methods developed in general FL for non-IID data do not work in FOLTR. Thus, a strong need for new methods specialised in the FOLTR settings emerges from these findings.

6. TYPE 2: Document label distribution skewness

Document label distribution skewness (Type 2) is a widely recognised type of non-IID data type in federated learning. In this setting, the label distributions in each client are different while the conditional feature distribution is shared across the clients. In terms of FOLTR, this is equivalent to the following situation. Assume a document is evaluated across the -level relevance grades, from not relevant (0) to perfectly relevant (); then the label distribution on each client is such that, for client

, the probability of holding documents with relevance label

is , where ,

In practice, this may be represented by a situation like the following. Several hospitals are collaboratively creating a FOLTR ranker for clinical-decision-support (Roberts et al., 2015, 2016). Certain hospitals hold a significantly larger portion of highly relevant health records for a certain disease, while some only a small fraction. In this circumstance, the document label distribution is skewed. Under the context of email search (Narang et al., 2017), different clients might have unique strategies for managing personal emails (Whittaker and Sidner, 1996). Some clients frequently clean up their inboxes and use folders to organise emails. In contrast, some hardly use folders or delete irrelevant messages, resulting in different label distribution when following a learning-to-rank approach.

(a) MSLR-WEB10k (linear ranker)
(b) MSLR-WEB10k (neural ranker)
Figure 5. Offline performance (nDCG@10) on MSLR-WEB10k for Type 2 (), under three instantiations of SDBN click model and three local updates setting (); results averaged across all dataset splits and experimental runs.
(a) MSLR-WEB10k (linear ranker)
(b) MSLR-WEB10k (neural ranker)
Figure 6. Offline performance (nDCG@10) on MSLR-WEB10k for Type 2 (), under three instantiations of SDBN click model with local updates setting (); results averaged across all dataset splits and experimental runs.

6.1. Simulating Type 2 non-IID Data

In this section, we discuss how we synthetically simulate Type 2 non-IID data and IID data to baseline in the FOLTR effectiveness. For these experiments, we use the popular datasets MSLR-WEB10k (Qin and Liu, 2013) (10,000 queries), Yahoo (Chapelle and Chang, 2011) (29,900 queries) and Istella-S (Lucchese et al., 2016) (33,018 queries). We report the results for MSLR-WEB10k in the paper; results on the other datasets are similar and are provided in the online appendix. We simulate clients with each client performing interactions (queries) locally to contribute to each global model update and restrict the global communication times . For simulating querying behaviour, for each client participating in the federated OLTR, we sample queries randomly, in line with previous work on FOLTR (Kharitonov, 2019; Wang et al., 2021b). For each query, we use the local ranking model (i.e. that held by the client) to rank documents; we limit SERP to 10 documents. For the click behaviour, we rely on the same SDBN click models as Section 5. We train both a linear ranker and a neural ranker same as Wang et al. (Wang et al., 2021b).

We specifically consider two types of non-IID data for Type 2: non-IID subtype 1 and non-IID subtype 2. The main difference between the two types is the number of different labels (i.e. the graded relevance assessments) in each client’s local dataset. Following similar partitioning strategies by Li et al. (Li et al., 2022), suppose each client only has data samples for different labels. We first generate all possible -combinations of the relevance set and randomly assign to clients. Then, for the query-document pairs of each label, we randomly and equally divide them into the clients who own the label. In this way, the number of labels in each client is fixed, and there is no overlap between the samples of different clients.

In non-IID subtype 1, each client only holds query-document pairs from one specific value of relevance label. We use to denote this partitioning strategy. This federated setup involves clients. We also vary the local updating time to investigate the impact of local updating with a fixed global communication time . For non-IID subtype 2, each client holds data samples from two relevance labels – we denote this as . We simulate clients and for fair comparison between the two non-IID subtypes, we also simulate clients for (with each label distributed on two different clients).

The IID experimental setting is the same as the non-IID in terms of federation, ranker parameters and evaluation procedure, except that each client now randomly picks a query from the whole training set with all graded judgements during the training period.

6.2. Impact of Type 2 non-IID Data

The offline performance for on MSLR-WEB10k is shown in Figure 5 and the corresponding online performance is shown in Table 3. From the offline results, it is clear that the rankers learned with non-IID data under-fit the generalized held-out test set under all three settings of local updating times (). For the perfect click model, a larger number of achieves better test performance. However, when it comes to noisier clicks (navigational or informational), the trend is reversed, although differences are minimal and the model performance fluctuates. For the online results, all non-IID settings appear to over-fit the maximum value () as each client’s local data only contains data from one relevance label.

linear ranker neural ranker
(r)2-8 click
IID per. 742.10 778.56 798.05 716.55 781.18 815.8
nav. 698.25 743.35 771.04 649.83 728.96 775.87
inf. 672.23 722.23 757.10 612.76 693.35 748.64
non-IID per. 1589.23 1589.23 1589.23 1589.23 1589.23 1589.23
nav. 1589.23 1589.23 1589.23 1589.23 1589.23 1589.23
inf. 1589.23 1589.23 1589.23 1589.23 1589.23 1589.23
Table 3. Online performance on MSLR-WEB10k for Type 2 (), averaged across dataset splits and experimental runs.
(a) MSLR-WEB10k (linear ranker)
(b) MSLR-WEB10k (neural ranker) - data sharing
(c) MSLR-WEB10k (neural ranker) - FedProx & FedPer
Figure 7. Offline performance on MSLR-10k when using Data-sharing, FedProx and FedPer on Type 2 non-IID data (with ); results averaged across dataset splits and experimental runs.

Offline results for are shown in Figure 6. In this case, the effectiveness of the learnt rankers is much higher than for : a diversity in labels held by a client prevents major losses in FOLTR effectiveness. This result is also consistent with previous results in general federated learning with non-IID data: Li et al. (Li et al., 2022) found that the most challenging setting is when each client only has data samples from a single class (label). We further note that another reason for the performance gap is the pairwise loss used in FPDGD (Wang et al., 2021b): when each client only has one relevance label, it is hard to infer preferences between document pairs (as they both have the same label). However, given labels from two levels of relevance (), pairwise differences can be effectively inferred. This suggests that the results obtained here for Type 2 data may not generalise to other FOLTR methods beyond FPDGD if they do not rely on the pairwise preference mechanism. We further note, however, that FPDGD is the current state-of-the-art method and that the only available alternative (Kharitonov, 2019) displays highly variable and sensibly worse performance compared to FPDGD (Wang et al., 2021c, d). Therefore, new methods of FOLTR must also be validated in the presence of Type 2 data.

In summary, we find that if data is distributed in a non-IID manner across clients according to Type 2, the effectiveness of FOLTR (and specifically of FPDGD) is seriously affected in the case of ; however, if then gaps in effectiveness compared to IID settings are minimal.

6.3. Dealing with Type 2 non-IID Data

To mitigate the effect of Type 2 non-IID data, we investigate three existing methods from the federated learning literature: Data-sharing, FedProx and FedPer. FedProx and FedPer have been described in Section 5.3. Data-sharing was first proposed by Zhao et al. (2018). They attribute the performance reduction observed on non-IID data to the weight divergence, which is further affected by the divergence between the local data distribution and the overall distribution. They then introduce a straightforward idea to improve FedAvg: slightly reduce the divergence that causes the global model to underperform. This can be achieved as follows. A globally shared dataset characterised by the overall data distribution is centralised on the server, and a warm-up global model is trained from . Then, a random proportion of is sent to all clients to update the local model by both local training data and the shared data from

. Lastly, the server aggregates the local models from the clients and updates the global model with FedAvg. Experimental results on machine learning tasks show that data sharing can significantly enhance the global model performance in the presence of non-IID data. However, the shortcomings are also pronounced. It is challenging to collect uniformly distributed global datasets in real-world scenarios because either the global server needs some prior knowledge about the local data distributions or each client needs to share parts of the local data (violating the privacy requirement underlying FL).

Figure 7 reports the results for Data-sharing, FedProx and FedPer on MSLR-10k dataset under label distribution skewness . We randomly select 10% of the entire dataset as the globally shared data and simulate clients with local updates before each global update. Results show that the global performance can be significantly enhanced with data-sharing for both linear and neural rankers. On the other hand, neither FedProx nor FedPer provides statistically significant gains over the basic FPDGD on Type 2 non-IID data (with ).

7. Other Data Types

7.1. Type 3: Click Preferences

Next, we consider as a source of non-IID data the noise and biases caused by the different click preferences arising from different clients that participate in the FOLTR training; we term this type of non-IID data as click preference skewness (Type 3).

The mechanism to emulate non-IID data of Type 3 and IID data baseline in our FOLTR experiments is as follows. We study two widely used click models: the Simplified Dynamic Bayesian Network (SDBN) click model (Chapelle and Zhang, 2009) and the Position-Based Model (PBM) (Craswell et al., 2008). For non-IID settings in SDBN, each client chooses one of three widely-used instantiations of SDBN, namely perfect, navigational, informational. For the non-IID settings with PBM, we generate 5 instantiations based on varying the parameter: each client is represented by one click type. Thus the federated setup involves 3 clients for SDBN clicks and 5 for PBM. We set the local updating time with fixed global communication times . In the IID setting, at every time, each client is simulated based on a click model randomly picked from all click models instantiations detailed above and used in the non-IID setting; this provides a fair comparison between the IID and non-IID settings. We experiment on MSLR-WEB10k, Yahoo and Istella-S.

For both online and offline performance, and all datasets, our experimental results show that the difference between non-IID and IID data for Type 3 is not significant; for further details we refer the reader to the Appendix in this paper and the online appendix at

7.2. Type 4: Data Quantity

Finally, we consider the case of data quantity skewness (Type 4); this occurs when the number of training data varies across different clients. It is a common scenario in real-world applications. For example, in FOLTR, some clients tend to issue more queries and interact more with the searching system than others. Thus, they have more data for training than others. The situation represented by Type 4 may occur in combination with the other data types. In our empirical experiments, we have studied Type 4 data both on its own and combined with the document preferences skew (Type 1) and the document label distribution skew with (Type 2).

Type 4 data is simulated by assigning different numbers of queries () to each client during the same local updating period, thus leading to different local updating times for each client. The number of queries varies in {1, 3, 5, 7, 9} and we simulate clients in total with fixed global communication times . Experiments are carried out on MSLR-WEB10k, Yahoo and Istella-S.

When mixing other non-IID types with Type 4, we follow the same experimental settings of previous non-IID types, and we also assign different numbers of queries to each client (from {1, 3, 5, 7, 9}) during the same local updating period. Instead, in the IID simulation, each client has 5 iterations of searching for different queries. For both IID and non-IID, we use SDBN click models for click simulation and train a linear ranker using FPDGD on MSLR-WEB10k for Type 1 with , and the dataset from Zhuang and Zuccon (2021) for Type 2.

Empirical results333In the Appendix in this paper and in the online appendix at show that if data is distributed in a non-IID manner across clients according to Type 4, the effectiveness of FPDGD is not impacted. We stress that this result may be specific to FPDGD because it uses the FedAvg paradigm and does not generalise to other FOLTR methods.

8. Outlook and Discussion

In this paper, we provide a new perspective on the problem of data distribution across clients for federated online learning to rank. Next, we summarise the our key findings and draw directions for future research.

Impact of non-IID data. We found that the presence of non-IID characteristics in the distribution of document preferences (Type 1) and specific cases of document labels (Type 2) have severe effects on the effectiveness of FPDGD. Conversely, if data is distributed across clients in a non-IID manner concerning click preferences (Type 3) or data quantity (Type 4), no significant effects on the quality of FPDGD are observed. These findings contribute an understanding of under which data distributions it is safe to use FOLTR and when it is not. We believe this paper will encourage researchers to include non-IID data settings when evaluating new FOLTR methods.

Calling for FOLTR methods to address non-IID issues. Our paper charts directions to direct future work on non-IID data in FOLTR concerning the creation of techniques that provide remedies to Type 1 and 2, while deeming solutions for Type 3 and 4 data less critical. Importantly, we show that existing solutions employed in general federated learning to mitigate the non-IID data problem do not apply to the FOLTR setting, despite some of these non-IID cases (and especially Type 1) being likely to occur across many FOLTR systems. Thus, researching how to address non-IID data in FOLTR is a worthwhile area of investigation.

Privacy should be a high priority when dealing with non-IID data. Our analysis found that only the data-sharing technique could address to significant extents Type 2 non-IID data. However, this and similar methods, although performing well, require the prior knowledge about the users’ local data distributions – and thus require users to share private data, largely defeating the purpose of federated learning. We note that recent work has considered the sharing of synthetic, rather than real, data (Wang et al., 2021a). In such a setting, real data would be used by each client to generate synthetic data, and the synthetic data only would be shared in the federation. However, we could not find evidence of the loss in effectiveness associated with the use of synthetic rather than real data in the data-sharing scheme. Furthermore, it is unclear what the privacy guarantees are in such a synthetic data sharing scheme. Specifically, we wonder whether the use of synthetic data could jeopardise privacy as this synthetic data is generated from the real data: thus analysis of the synthetic data may reveal key aspects of and information contained in the real data. Thus, how to guarantee user’s privacy needs when designing effective FOLTR algorithms on non-IID data is still an open question.

Real-world datasets and benchmarks for FOLTR with non-IID data are need. The experiments put forward in this perspective paper to substantiate our views on the non-IID data problem in FOLTR are based on simulations. While simulations are prevalent in information retrieval and especially in its evaluation (Cooper, 1973; Azzopardi, 2016; Maxwell and Azzopardi, 2016; Zhang et al., 2017; Balog et al., 2021), a key aspect we had to simulate was the nature of the non-IID data, including their distributions. On one hand, this allows us to carefully control the experiments; on the other it limits the generalisability of the findings to real non-IID data that may occur in FOLTR settings. We therefore want to conclude with a call for action for information retrieval practitioners in this area: there is the pressing need for FOLTR benchmark datasets that provide standard simulations on real-world non-IID scenarios as well as standard hyper-parameter settings so that future FOLTR algorithms can be fairly studied.

9. Conclusion

The goal of FOLTR is to learn an effective ranker in a federated (without the need for searchable and interaction data to reside on a central server) and online (by exploiting users clicks on SERPs as they occur) manner. In such a FOLTR setup, user data and interactions reside with the user’s client, and not in a central server. Clients then do not need to share such data. Instead, they only share ranker updates with a central server whose responsibility is to collect such updates from the clients and aggregate them into a global model. The global model is then pushed back to the clients in an iterative manner as search interactions occur.

Despite federated learning receiving substantial attention, research in FOLTR is still in its early stages, with only two methods available at the time of writing (Kharitonov, 2019; Wang et al., 2021b). Importantly, studies that have proposed FOLTR methods have ignored an important issue that has been shown to affect the performance of federated learning systems (Zhu et al., 2021): that of the data not being distributed across the federated clients in an identical and independent manner (non-IID data). This paper provides the first analysis of the impact of non-IID data on FOLTR and it charts directions for future research. Our findings and observations may be valid also in other contexts that consider to create a ranker from interaction data in a federated manner, e.g., in federated counterfactual leaning to rank (Li and Ouyang, 2021).

We make code, experimental details and results available at

We would like to thank the anonymous reviewers for their insightful feedback in further shaping the paper. We would also like to thank Dr Bevan Koopman and Dr Harrisen Scells for their thoughtful comments on earlier drafts of this paper.


  • M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choudhary (2019) Federated learning with personalization layers. arXiv preprint arXiv:1912.00818. Cited by: §5.3, §5.3.
  • L. Azzopardi (2016) Simulation of interaction: a tutorial on modelling and simulating user interaction and search behaviour. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 1227–1230. Cited by: §8.
  • K. Balog, D. Maxwell, P. Thomas, S. Zhang, and B. London (2021) Report on the 1st simulation for information retrieval workshop (sim4ir 2021) at sigir 2021. Cited by: §8.
  • P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui (2012) Modeling the impact of short-and long-term behavior on search personalization. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 185–194. Cited by: §2.2.
  • M. J. Carman, F. Crestani, M. Harvey, and M. Baillie (2010) Towards query log based personalization using topic models. In International Conference on Information and Knowledge Management, pp. 1849–1852. Cited by: §2.2.
  • O. Chapelle and Y. Chang (2011) Yahoo! learning to rank challenge overview. In Proceedings of the Yahoo! Learning to Rank Challenge, pp. 1–24. Cited by: §6.1.
  • O. Chapelle and Y. Zhang (2009) A dynamic bayesian network click model for web search ranking. In International Conference on World Wide Web, pp. 1–10. Cited by: §5.1, §7.1.
  • C. L. Clarke, N. Craswell, and E. M. Voorhees (2012) Overview of the trec 2012 web track. Technical report National Institute of Standards and Technology (NIST). Cited by: §5.1.
  • M. D. Cooper (1973) A simulation model of an information retrieval system. Information Storage and Retrieval 9 (1), pp. 13–32. Cited by: §8.
  • N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey (2008) An experimental comparison of click position-bias models. In International Conference on Web Search and Data Mining, pp. 87–94. Cited by: §7.1.
  • M. Duan, D. Liu, X. Chen, Y. Tan, J. Ren, L. Qiao, and L. Liang (2019) Astraea: self-balancing federated learning for improving classification accuracy of mobile deep learning applications. In IEEE International Conference on Computer Design, pp. 246–254. Cited by: §2.1.
  • A. Fallah, A. Mokhtari, and A. Ozdaglar (2020) Personalized federated learning with theoretical guarantees: a model-agnostic meta-learning approach. Advances in Neural Information Processing Systems 33. Cited by: §2.1.
  • S. Ge, Z. Dou, Z. Jiang, J. Nie, and J. Wen (2018) Personalizing search results using hierarchical rnn with query-aware attention. In International Conference on Information and Knowledge Management, pp. 347–356. Cited by: §2.2.
  • M. R. Ghorab, D. Zhou, A. O’connor, and V. Wade (2013) Personalised information retrieval: survey and classification. User Modeling and User-Adapted Interaction 23 (4), pp. 381–443. Cited by: §2.2.
  • A. Ghosh, J. Chung, D. Yin, and K. Ramchandran (2020) An efficient framework for clustered federated learning. arXiv preprint arXiv:2006.04088. Cited by: §2.1.
  • F. Hartmann, S. Suh, A. Komarzewski, T. D. Smith, and I. Segall (2019) Federated learning for ranking browser history suggestions. arXiv preprint arXiv:1911.11807. Cited by: §2.2.
  • M. Harvey, F. Crestani, and M. J. Carman (2013) Building user profiles from topic models for personalised search. In International Conference on Information and Knowledge Management, pp. 2309–2314. Cited by: §2.2.
  • K. Hofmann (2013) Fast and reliable online learning to rank for information retrieval. In ACM SIGIR Forum, Vol. 47, pp. 140–140. Cited by: §1.
  • R. Jagerman, H. Oosterhuis, and M. de Rijke (2019) To model or to intervene: a comparison of counterfactual and online learning to rank from user interactions. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 15–24. Cited by: §1.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14 (1–2), pp. 1–210. Cited by: §4, footnote 2.
  • E. Kharitonov (2019) Federated online learning to rank with evolution strategies. In International Conference on Web Search and Data Mining, pp. 249–257. Cited by: §1, §1, §5.1, §6.1, §6.2, §9.
  • A. Lalitha, O. C. Kilinc, T. Javidi, and F. Koushanfar (2019) Peer-to-peer federated learning on graphs. arXiv preprint arXiv:1901.11173. Cited by: footnote 1.
  • C. Li and H. Ouyang (2021) Federated unbiased learning to rank. arXiv preprint arXiv:2105.04761. Cited by: §2.2, §9.
  • Q. Li, Y. Diao, Q. Chen, and B. He (2022) Federated learning on non-iid data silos: an experimental study. In IEEE International Conference on Data Engineering, Cited by: §4, §6.1, §6.2.
  • T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020a) Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems 2, pp. 429–450. Cited by: §5.3, §5.3.
  • X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2020b) On the convergence of fedavg on non-iid data. In 8th International Conference on Learning Representations, Cited by: §2.1.
  • C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, F. Silvestri, and S. Trani (2016) Post-learning optimization of tree ensembles for efficient ranking. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952. Cited by: §6.1.
  • D. Maxwell and L. Azzopardi (2016) Simulating interactive information retrieval: simiir: a framework for the simulation of interaction. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 1141–1144. Cited by: §8.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In

    International Conference on Artificial Intelligence and Statistics

    pp. 1273–1282. Cited by: Figure 1, §1, §2.1.
  • K. Narang, S. T. Dumais, N. Craswell, D. Liebling, and Q. Ai (2017) Large-scale analysis of email search and organizational strategies. In ACM SIGIR Conference on Human Information Interaction and Retrieval, pp. 215–223. Cited by: §6.
  • H. Oosterhuis and M. de Rijke (2018) Differentiable unbiased online learning to rank. In International Conference on Information and Knowledge Management, pp. 1293–1302. Cited by: §1, §5.1, Algorithm 2.
  • H. Oosterhuis and M. de Rijke (2021)

    Unifying online and counterfactual learning to rank: a novel counterfactual estimator that effectively utilizes online interventions

    In International Conference on Web Search and Data Mining, pp. 463–471. Cited by: §1.
  • H. Oosterhuis, A. Schuth, and M. de Rijke (2016) Probabilistic multileave gradient descent. In European Conference on Information Retrieval, pp. 661–668. Cited by: §5.1.
  • T. Qin and T. Liu (2013) Introducing letor 4.0 datasets. arXiv preprint arXiv:1306.2597. Cited by: §4, §6.1.
  • K. Roberts, M. Simpson, D. Demner-Fushman, E. Voorhees, and W. Hersh (2016) State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the trec 2014 cds track. Information Retrieval Journal 19 (1), pp. 113–148. Cited by: §6.
  • K. Roberts, M. S. Simpson, E. M. Voorhees, and W. R. Hersh (2015) Overview of the trec 2015 clinical decision support track. In Text REtrieval Conference, Cited by: §6.
  • A. G. Roy, S. Siddiqui, S. Pölsterl, N. Navab, and C. Wachinger (2019) Braintorrent: a peer-to-peer environment for decentralized federated learning. arXiv preprint arXiv:1905.06731. Cited by: footnote 1.
  • F. Sattler, K. Müller, and W. Samek (2020) Clustered federated learning: model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems. Cited by: §2.1.
  • V. Smith, C. Chiang, M. Sanjabi, and A. Talwalkar (2017) Federated multi-task learning. arXiv preprint arXiv:1705.10467. Cited by: §2.1.
  • Y. Song, H. Wang, and X. He (2014) Adapting deep ranknet for personalized search. In International Conference on Web Search and Data Mining, pp. 83–92. Cited by: §2.2.
  • H. Wang, L. Muñoz-González, D. Eklund, and S. Raza (2021a)

    Non-iid data re-balancing at iot edge with peer-to-peer federated learning for anomaly detection

    In Proceedings of the 14th ACM Conference on Security and Privacy in Wireless and Mobile Networks, pp. 153–163. Cited by: §2.1, §8, footnote 1.
  • K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays, and D. Ramage (2019) Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252. Cited by: §2.1.
  • S. Wang, B. Liu, S. Zhuang, and G. Zuccon (2021b) Effective and privacy-preserving federated online learning to rank. In ICTIR ’21: The 2021 ACM SIGIR International Conference on the Theory of Information Retrieval, pp. 3–12. Cited by: §1, §1, §2.1, §2.2, §3, §5.1, §6.1, §6.2, §9.
  • S. Wang, S. Zhuang, and G. Zuccon (2021c) Federated online learning to rank with evolution strategies: a reproducibility study. In European Conference on Information Retrieval, pp. 134–149. Cited by: §1, §5.1, §6.2.
  • Y. Wang, Y. Tong, D. Shi, and K. Xu (2021d) An efficient approach for cross-silo federated learning to rank. In International Conference on Data Engineering, pp. 1128–1139. Cited by: §2.2, §6.2.
  • S. Whittaker and C. Sidner (1996) Email overload: exploring personal information management of email. In Conference on Human Factors in Computing Systems: Common Ground, pp. 276–283. Cited by: §6.
  • Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–19. Cited by: §4, footnote 2.
  • T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, and F. Beaufays (2018) Applied federated learning: improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903. Cited by: §2.2.
  • J. Yao, Z. Dou, and J. Wen (2021) FedPS: a privacy protection enhanced personalized search framework. In The Web Conference 2021, pp. 3757–3766. Cited by: §2.2.
  • J. Yao, Z. Dou, J. Xu, and J. Wen (2020)

    RLPer: a reinforcement learning model for personalized search

    In The Web Conference 2020, pp. 2298–2308. Cited by: §2.2.
  • Y. Zhang, X. Liu, and C. Zhai (2017) Information retrieval evaluation as search simulation: a general formal framework for ir evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, pp. 193–200. Cited by: §8.
  • Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §2.1, §2.1, §6.3.
  • H. Zhu, J. Xu, S. Liu, and Y. Jin (2021) Federated learning on non-iid data: a survey. Neurocomputing 465, pp. 371–390. Cited by: Figure 1, §1, §1, §2.1, §2.1, §4, §4, §9.
  • S. Zhuang and G. Zuccon (2020) Counterfactual online learning to rank. In European Conference on Information Retrieval, pp. 415–430. Cited by: §1, §5.1.
  • S. Zhuang and G. Zuccon (2021) How do online learning to rank methods adapt to changes of intent?. In International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §1, §5.1, §5.1, §7.2.
  • L. Zong, Q. Xie, J. Zhou, P. Wu, X. Zhang, and B. Xu (2021) FedCMR: federated cross-modal retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1672–1676. Cited by: §2.2.


Table 4 reports the values of the parameters of the SDBN click models we used in the experiments.

Figure 8 reports the results of our experiments for non-IID data type 3: click preferences.

Figure 9 reports the results of our experiments for non-IID data type 4: data quantity.

For both type 3 and type 4 data experiments, as well as for other data types, the interested reader can find additional analysis and figures in the online appendix available at

(r)2-6 rel(d) 0 1 2 3 4
perfect 0.0 (0.0) 0.2 (1.0) 0.4 (-) 0.8 (-) 1.0 (-)
navigational 0.05 (0.05) 0.3 (0.95) 0.5 (-) 0.7 (-) 0.95 (-)
informational 0.4 (0.3) 0.6 (0.7) 0.7 (-) 0.8 (-) 0.9 (-)
(r)2-6 rel(d) 0 1 2 3 4
perfect 0.0 (0.0) 0.0 (0.0) 0.0 (-) 0.0 (-) 0.0 (-)
navigational 0.2 (0.2) 0.3 (0.9) 0.5 (-) 0.7 (-) 0.9 (-)
informational 0.1 (0.1) 0.2 (0.5) 0.3 (-) 0.4 (-) 0.5 (-)
Table 4. Instantiations of SDBN click model for simulating user behaviour in experiments. denotes the relevance label for document . Note that in the intent-change dataset, only two-levels of relevance are used. We demonstrate the values for intent-change in bracket.
(a) MSLR-WEB10k (linear ranker) under SDBN clicks
(b) MSLR-WEB10k (linear ranker) under PBM clicks
Figure 8. Offline performance (nDCG@10) on MSLR-WEB10k for Type 3, separately under SDBN and PBM click model; results averaged across all dataset splits and experimental runs.
(a) MSLR-WEB10k (linear ranker)
(b) intent-change (linear ranker) mixed with type 1
(c) MSLR-WEB10k (linear ranker) mixed with type 2 ()
Figure 9. Offline performance (nDCG@10) on MSLR-WEB10k and intent-change for Type 4, under three instantiations of SDBN click model; results averaged across all dataset splits and experimental runs.