FEL: High Capacity Learning for Recommendation and Ranking via Federated Ensemble Learning

by   Meisam Hejazinia, et al.

Federated learning (FL) has emerged as an effective approach to address consumer privacy needs. FL has been successfully applied to certain machine learning tasks, such as training smart keyboard models and keyword spotting. Despite FL's initial success, many important deep learning use cases, such as ranking and recommendation tasks, have been limited from on-device learning. One of the key challenges faced by practical FL adoption for DL-based ranking and recommendation is the prohibitive resource requirements that cannot be satisfied by modern mobile systems. We propose Federated Ensemble Learning (FEL) as a solution to tackle the large memory requirement of deep learning ranking and recommendation tasks. FEL enables large-scale ranking and recommendation model training on-device by simultaneously training multiple model versions on disjoint clusters of client devices. FEL integrates the trained sub-models via an over-arch layer into an ensemble model that is hosted on the server. Our experiments demonstrate that FEL leads to 0.43-2.31 quality improvement over traditional on-device federated learning - a significant improvement for ranking and recommendation system use cases.


page 1

page 2

page 3

page 4


Cross-Silo Federated Learning: Challenges and Opportunities

Federated learning (FL) is an emerging technology that enables the train...

Federated Evaluation and Tuning for On-Device Personalization: System Design Applications

We describe the design of our federated task processing system. Original...

FLaaS: Enabling Practical Federated Learning on Mobile Environments

Federated Learning (FL) has recently emerged as a popular solution to di...

Towards On-Device Federated Learning: A Direct Acyclic Graph-based Blockchain Approach

Due to the distributed characteristics of Federated Learning (FL), the v...

Protea: Client Profiling within Federated Systems using Flower

Federated Learning (FL) has emerged as a prospective solution that facil...

FedOBD: Opportunistic Block Dropout for Efficiently Training Large-scale Neural Networks through Federated Learning

Large-scale neural networks possess considerable expressive power. They ...

ZeroFL: Efficient On-Device Training for Federated Learning with Local Sparsity

When the available hardware cannot meet the memory and compute requireme...

1 Introduction

Federated learning (FL) provides means to enhance data privacy for AI technologies. FL enables a large number of decentralized computing platforms at the edge to collaboratively train a shared machine learning (ML) model, while having raw data samples remain on-device. In the FL setting, the ML model is pushed to edge devices where data resides. In particular, local models are trained on edge devices, and updates to their parameters are shared with the central server using a secure aggregation protocol Bonawitz et al. (2017); Huba et al. (2022). The server updates parameters of the global model and iterates by broadcasting them to the participating devices. FL has been deployed for a variety of machine learning tasks, such as smart keyboard AbdulRahman et al. (2020), personalized assistant services Hao (2020)

, computer vision

Liu et al. (2020a), healthcare Rieke et al. (2020), and ranking Hartmann et al. (2019); Paulik et al. (2021).

Despite its growing profile, FL has seen limited adoption for ranking and recommendation tasks. The reasons are two-fold: (1) stringent accuracy requirement and (2) limited compute and memory capacity resources of client devices. Recommendation systems are subject to strict accuracy requirements. A recent study from Baidu Zhao et al. (2020) indicates that even a 0.1% accuracy drop is considered unacceptable for its ranking and recommendation tasks. However, due to the resource limitation of client devices, only small models with a constrained capacity (e.g., 20MB Huba et al. (2022)) can be used. The small model size requirement necessarily limits the achievable accuracy, as the learning capacity is much smaller than models traditionally used in the server setting, whose sizes are in the order of several GBs or TBs Zhao et al. (2020); Lui et al. (2021).

While accuracy drop due to capacity constraints of the FL-trained model appears to be inevitable, it is not the case in the important setting of Label-only Privacy Ghazi et al. (2021); Malek et al. (2021). Indeed, the standard FL setup assumes that the entirety of users’ data is private, and therefore both training and inference must be done on-device, severely limiting the model’s size. On the other hand, if only the labels are deemed private, the inference can be performed server-side, potentially removing the model size pressure. Even if Label-only Privacy is assumed, however, it is unclear how a large server-side inference model can be trained using FL.

One possible approach to train a large server-side model using FL with Label-only Privacy is to use split learning Vepakomma et al. (2018); Poirot et al. (2019). In the Label-only Privacy setting, split learning places only the last few layers of the model on client devices, while the rest of the model remains on the server. During training, the server and the client device exchange forward activations and backward gradients to jointly train the large model, without revealing the labels to the server. However, split learning imposes several efficiency, scalability, and privacy challenges, as communication occurs at every forward/backward pass Vepakomma et al. (2018) and backward gradients can leak label information Pasquini et al. (2021).

To increase the effective learning capacity of FL in the Label-only Privacy case, we propose a new approach, Federated Ensemble Learning (FEL). In FEL, multiple federated learning models, or leaf models, are trained simultaneously on disjoint clusters of client devices. After being trained, the leaf models are combined together (ensembled) to form a larger model at the server. The insight behind FEL is that if client device clusters are appropriately formed, leaf models can learn distinct, potentially complementary, knowledge from each cluster, which are later combined to enhance the overall prediction capability. The idea stems from the logic that marketers have been using for hundreds of years: segmenting users into clusters allows fitting better products Beane and Ennis (1987) or models to each cluster.

In contrast with split learning, FEL combines communication efficiency with strong privacy guarantees. Leaf models can be trained concurrently with each other when there is sufficient number of client devices. As each leaf model is trained using traditional FL, the privacy implication is similar to that of FL (Section 3.2). FEL excels when the number of observations per user is small, but the number of users is high (e.g., in the order of billions), a setting which is common for recommendation and ranking tasks.

We have deployed FEL in the production environment of a large-scale ranking system. We show that the deployed recommendation task achieves 0.43% precision gain compared to the vanilla FL baseline in the production environment, which is significant in the context of production ranking and recommender systems 111In similar applications, Zhao et al. (2020) mentions 0.1% as significant and Wang et al. (2017) considers 0.001 logloss, which is around 0.23% in their context, as impactful.

. We observe a similar gain of 1.55–2.31% using the open-source datasets: one for Ads click prediction 

Sabnagapati (2020) and the other on image classification Liu et al. (2015). In addition, we show that FEL outperforms standard FL in the presence of differential privacy (DP) noise by 0.66–1.93%.

2 Background: Federated Learning and Privacy Assumptions

Federated Learning (FL): FL trains a model collaboratively using clients’ devices without clients’ having to share their raw data with the server. To train a model with FL, the central server first selects clients to participate from its client pool and broadcasts the model to selected clients. Then, each client trains the model using their own data, using their local computation resources. When the training is finished, each client sends back their trained model (or equivalently, the gradient) to the server. Finally, the server collects the gradients and aggregates them (e.g., by taking a weighted average McMahan et al. (2017)) to update the server-side model. The process repeats until the model converges.

Scaling Federated Learning: The key constraint in using vanilla FL training for recommendation and ranking tasks is the limited model size it can support. In many federated recommendation and ranking tasks, the input space is large, e.g., over 1,000 features, and the data distribution is multi-modal. To achieve high accuracy in this setting, we have to increase the overall model capacity. However, FL can be applied only to sufficiently small models that can be trained on-device.

Prior work studied increasing the model capacity by leveraging client heterogeneity: training larger models on devices that are more powerful, while sending smaller models to less capable devices. The smaller model can be a subsampled model that is later aggregated to the supernet Horvath et al. (2021); Caldas et al. (2018); Diao et al. (2021), or a totally different model that later transfer its learned knowledge to the larger model with knowledge distillation Lin et al. (2020); Li and Wang (2019). These approaches still limit the capacity of the model as the model size is capped by the most powerful client devices (e.g., high-end smartphones), which still cannot train GB-size models.

In the case of Label-only Privacy, it is possible to use an inference model that is too large to fit onto any single device. Still, it is unclear how such a large inference model can be trained when labels reside on the client devices. Split learning splits the model and trains only the latter layers on the device, while training the rest on the server Vepakomma et al. (2018); Poirot et al. (2019). Split learning, however, requires the device and the server to exchange intermediate activations, which can leak private label information Pasquini et al. (2021). Also, many split learning approaches target cross-silo FL where only a handful clients participate Vepakomma et al. (2018); Poirot et al. (2019) and it is unclear how these approaches can scale to cross-device FL with billions of devices, as each forward/backward pass of each client involves communicating with the server.

Other work used client models as the teacher to train a separate, larger server-side model He et al. (2020); Cho et al. (2022). These approaches, however, require both the client and the server to hold a representative public dataset to perform knowledge distillation He et al. (2020); Cho et al. (2022). In the real world, such a public dataset is unrealistic to assume in many scenarios.

Privacy Assumptions of FEL: FEL targets Label-only Privacy setting, where the input is public (i.e., accessible to the service provider) while the label is private. Several advertising, recommendation, and survey/analytics applications fall into this category Ghazi et al. (2021); Malek et al. (2021); Nalpas and Dutton (2020); Pfeiffer III et al. (2021).

Many recommendation tasks use features that are public (vis-à-vis the recommender system). Public features include user attributes that are explicitly shared at sign-up time (e.g., age or gender) Sabnagapati (2020), externally observable user behavior (e.g., public movie reviews) Grouplens (2016), or item information that is provided by the item vendors Naumov et al. (2019). On the other hand, the labels of recommendation tasks may be considered private, as they often capture the users’ conversion behavior (e.g., whether the user engaged with the recommended item by watching, clicking, buying, or signing up) Niu et al. (2020); Ghazi et al. (2021). It is also possible to use a mixture of public and private features. FEL can be extended to incorporate private features (Section 3).

While user labels must be kept private, i.e., unknown to the service provider, there can be opt-in users who willingly agree to share their private label information to the service provider to improve the service quality. FEL does not require the presence of opt-in users; however, having some opt-in user population can simplify the training algorithm (Section 3) and lower the privacy cost (Section 3.2).

To avoid statistical inference attacks targeting FL, differentially private (DP) noise is added either on device or during the aggregation step Geyer et al. (2017); Truex et al. (2019); Wei et al. (2020). There are various notions of privacy, but we focus on user-level differential privacy, in which the trained model weights are similarly distributed with or without a particular user McMahan et al. (2017). We assume an honest-but-curious provider for training FL, which uses either hardware-based encryption in a trusted enclave, or software-based encryption via multiparty computation for FL aggregation Li et al. (2020b); Mondal et al. (2021).

3 Proposed Design: Federated Ensemble Learning (FEL)

Rather than training one model across all users, FEL proposes to train a distinct leaf model per user cluster and later aggregate the leaf models on the server to obtain a larger-capacity model. Ensemble methods similar in spirit have shown promising potential with classical machine learning algorithms, e.g., in the form of AdaBoost, random forest, and XGBoost

Le et al. (2021). We explored different variations in how the clients are clustered and how the leaf models are aggregated.

3.1 Federated Ensemble Learning

Figure 1 illustrates the proposed design of FEL. First, we cluster the clients into a desired number of clusters (Figure 1, Step 1). Then, we train a leaf model for each cluster using traditional FL (Figure 1, Step 2). After training, the server uses the ensemble of the leaf models for inference requests (Figure 1, Step 3). On each inference request, the server passes the public input features through all the leaf models. Then, the output of each leaf model and/or the output of the last hidden layer of each leaf model is passed to the ensemble aggregation layer, which outputs the final prediction. Below, we lay out how FEL forms clusters and implements ensemble aggregation, and how FEL can be extended to incorporate private features.

Forming Clusters

Client clusters can be formed based on distinctive characteristics of the users, e.g., a user’s age, location, or past preferences. Clusters can be also obtained by simple hashing, or through popular clustering approaches such as k-means or Gaussian mixture methods. Marketers have been forming clusters of clients to target each cluster more effectively, and those well-studied clustering methods can be adopted 

Beane and Ennis (1987).

Rudimentary Ensemble Aggregation: A simplest way to implement ensemble aggregation is to collect the prediction output of each leaf model and perform typical aggregation methods, such as mean, median, or max, to generate the final prediction. This approach is similar to bagging technique leveraged in random forest Prasad et al. (2006).

Neural Network (NN)-based Ensemble Aggregation:

NN-based ensemble aggregation uses a separate neural network that takes in the prediction and the output of the last hidden layer of all leaf models as input to generate the final prediction. We call this additional neural network model

the over-arch model. The over-arch model is trained after all the leaf models are trained. The over-arch model can be trained in several different ways. In the presence of opt-in users, the server can simply use them to train the over-arch model. Otherwise, the server can again use FL to train the over-arch model on each client, where the server sends the output of the leaf models to the client, and the client uses it as an input to train the over-arch model locally.

Extending to Private Features FEL can be extended to support private features only known to the clients. This can be done by (privately) training a separate leaf model (private leaf model) that only takes in private client-side features. The output of the private leaf model is used as an input to the ensemble aggregation layer along with other leaf models. When the private leaf model is used, part of the inference must happen on-device instead of entirely on the server. Specifically, the server sends outputs of the leaf models to each client, which ensembles them with its private leaf model output on-device for prediction. By performing the ensemble on-device, the input/output of the private leaf model is never exposed to the server. The private leaf model is trained with conventional FL as well.

Figure 1: Federated Ensemble Learning architectures

3.2 Privacy Analysis for Federated Ensemble Learning

Differential privacy for federated learning bounds how much model parameters change for two datasets with only one different user McMahan et al. (2017). Formally we define:

Definition 1 (Renyi Differential Privacy (RDP)). A randomized mechanism with domain is -RDP with order iff for any two neighboring datasets :

FEL is a multi-step process. To analyze the privacy bounds of a multi-step process, a common approach in differential privacy is to evaluate each process individually, then calculate the overall privacy bounds by composing all of the steps. In particular, we focus on two main composition theorems in differential privacy, sequential and parallel composition Dwork and Lei (2009). Formally we define:

Theorem 1 (Sequential Composition). Let there be n RDP-mechanisms with -RDP when being computed on a dataset of the input domain . Then, the composition of n mechanisms is -RDP

Theorem 2 (Parallel Composition). Let there be n RDP-mechanisms with -RDP when being computed on disjoint subset of the input domain . Then, the composition of n mechanisms is -RDP

Sequential composition considers the case where a task uses the same users (even if different steps use different parts of the user’s data) in different steps of an algorithm. For example, if the algorithm has four steps each with a privacy cost and uses the same users in all the steps, the total privacy cost of the algorithm would become . Parallel composition considers a case where each step is applied to different users. From the earlier example, if four disjoint set of users were used for the for steps, the overall privacy cost would be , where is the privacy cost of the -th step. It should be noted that the compositions are analyzed using RDP. The additivity is not exactly linear in -DP, and while possible, -DP cannot yield as tight bounds as RDP.

When analyzing the privacy cost of training the leaf models, the parallel composition theorem applies as each leaf model is trained with a disjoint set of users. If training a leaf model with cluster has a privacy cost , the privacy cost of the entire leaf model is .

The privacy cost of the ensemble aggregation layer depends on the aggregation method that is used (mean, max, median, or NN-based). When using the mean, max, and median ensemble aggregation, as both theorems show that privacy cost increases only when the additional step (ensemble aggregation) uses user data. In this case, the total privacy cost is simply .

When NN-based approach is used, however, user data is used to train the over-arch NN layer and . Depending on exactly how the over-arch NN layer is trained, can be calculated in the following way. First, if the over-arch NN layer is trained with the users that trained the leaf models, sequential composition theorem applies (). Second, if the over-arch NN layer is trained with a completely different set of users, the parallel composition theorem applies (). Finally, if the over-arch NN layer is trained with opt-in users, because no private data is used, and . In all cases, the privacy cost of FEL does not significantly deteriorate over the vanilla FL (which is similar to ).

4 Experimental Methodology

4.1 Datasets

We use three datasets for the purpose of this study. To study recommendation and ranking tasks, we used a production dataset and an open-source, Taobao’s Click-Through-Rate (CTR) prediction dataset Li et al. (2021). To study the effect of FEL on non-recommendation use-cases, we additionally studied the LEAF CelebA Smile Prediction dataset Liu et al. (2015).

Production Dataset: Production dataset is an internal dataset that captures whether a user installs a mobile application after being shown a relevant advertisement item. A few hundred features are used as an input (the exact number cannot be disclosed) to predict a binary label (install/not-install). All the input features are public. For training, we use advertisement data from a random sample of 35 million users over a period of one month. Randomly selected 15 million users from the following week were used for testing.

Taobao CTR Dataset: Taobao dataset contains 26 million interactions (click/non-click when an Ad was shown) between 1.14 million users and 847 thousand items across an 8-day period. The dataset uses 9 user features (e.g., gender or occupation), 6 item features (e.g., price or brand), and two contextual features (e.g., the day of week), which we assume to be all public to the service provider.

In the Taobao CTR dataset, 16 out of the 17 features are sparse, with a categorical value encoding instead of a continuous, floating point value. While server-based recommendation models use large embedding tables to convert these sparse features into a floating point embedding Zhou et al. (2018); Naumov et al. (2019); Cheng et al. (2016), training such embedding tables on device is complicated because of the large memory capacity requirement (e.g., in the order of GB to TB Zhao et al. (2020); Acun et al. (2021); Wilkening et al. (2021); Lui et al. (2021)) and can leak private information more easily through gradients Niu et al. (2020). Thus, we assume an architecture where embedding tables are pre-trained with opt-in users and are hosted on the server, while the rest of the model is trained with FEL using sparse features translated through the pre-trained tables. We randomly selected 10% of the users as opt-in.

Note that our setup cannot achieve the accuracy that can be reached when we fully train the embedding tables, as we pre-train the embedding table and fix their weight during FL. However, our setup represents a practical FL setup where training embedding tables on-device is prohibitive, due to client resource limitations Nguyen et al. (2021) and privacy concerns Niu et al. (2020).

CelebA Smile Prediction Dataset: While FEL is originally designed for recommendation and ranking tasks, we study its generality to non-recommendation models with CelebFaces Attributes Dataset (CelebA) Liu et al. (2015). CelebA consists of images belonging to unique celebrities. Each image has 40 binary facial attribute annotations (e.g., bald, long hair, attractive, etc) and covers large pose variations and backgrounds. We defined distinguishing between smiling/non-smiling images as our target task.

4.2 Model Architectures

Production/Taobao Dataset: For recommendation datasets (production/Taobao CTR), we use a model that consists of 3 fully-connected hidden layers. The number of units at each hidden layer is decreasing exponentially with a parameter K. For instance, if and the input layer has 512 features, our neural network would have neurons. For each dataset, we tune K to obtain a resulting model of approximately

MB. By doing so, it allows us to train a neural network even on older, low-tier devices with more limited memory capacity. ReLu is used as an activation function after each layer apart from the last one, where Sigmoid and binary cross-entropy was used.

For both datasets, we use synchronous FL with FedAvg McMahan et al. (2017)

. We used the following hyperparameters for the Taobao dataset from an extensive hyperparameter search: client batch size of 32, 5 local epochs, 4096 clients per round, and a learning rate of 0.579 with SGD. Clients are selected at random and each only participates once (1 global epoch). The production dataset used similar hyperparameters.

For Taobao dataset’s server-side pre-trained embedding table, we use an embedding dimension of 32, and train it with the 10% opt-in users for 1 epoch using AdaGrad optimizer with learning rate of 0.01.

CelebA Dataset: For CelebA, we follow the setup of prior work Nguyen et al. (2021)

and use a four layer CNN with dropout rate of 0.1, stride of 1, and padding of 2. We preprocess all images in train/validation/test sets; each image is resized and cropped to 32

32 pixels, then normalized by 0.5 mean and 0.5 standard deviation. We use asynchronous FL with a client batch size of 32 samples, 1 local epoch, 30 global epochs, and a learning rate of 0.899 with SGD.

4.3 FL Baseline and FEL

Both the FL baseline and the FEL leaf models used the same set of hyperparameters. The FL baseline is trained using all the available client data. In FEL, the client data is clustered, and one leaf model is trained for each cluster. We vary the number of clusters from 3–10 and evaluate different clustering methods. When training the over-arch NN layer, a small subset of opt-in users is used.

5 Evaluation Results and Analysis

Dataset Config Feature # clusters
Production Clustering 1 Age 5
Clustering 2 App 5
Clustering 3 Location 4
Clustering 4 Click ratio 10
Taobao Sabnagapati (2020) Clustering 1 Age 7
Clustering 2 Consumption 4
Clustering 3 City level 5
CelebA Liu et al. (2015) Clustering 1 # Attributes 3
Clustering 2 K-means 3
Clustering 3 K-means 5

Table 1: Explanation of different cluster methods in Figure 2 (right).

Our evaluation aims to answer the following questions:

  • Can FEL improve the model prediction quality over vanilla FL? [Section 5.1]

  • How do different ensemble aggregation methods affect the model accuracy? [Section 5.1]

  • How do different clustering methods affect the model accuracy? [Section 5.2]

  • How does FEL affect privacy compared to vanilla FL? [Section 5.3]

5.1 Prediction Quality Improvement of FEL

Overall, FEL achieves 0.43% and 2.31% prediction quality improvement over vanilla FL for production and Taobao datasets, respectively – a significant improvement for ranking and recommendation system use cases222Zhao et al. (2020) mentioned 0.1% model quality improvement as significant and Wang et al. (2017) considered 0.23% as impactful in similar recommendation and ranking use-cases.. For non-recommendation tasks (CelebA), FEL shows similar improvement of 1.55%, indicating that FEL can be generalized to non-recommendation use-cases as well. Table 2 summarizes the resulting prediction quality improvement of FEL compared to the baseline FL. Following common practice of each dataset, we used accuracy for CelebA Nguyen et al. (2021) and ROC-AUC (AUC) for Taobao Sabnagapati (2020). We used normalized entropy for the production dataset, which we cannot disclose and only show the relative improvement. For different ensemble aggregation methods, we vary the clustering methods and report the best-accuracy results.

Production Taobao Sabnagapati (2020) AUC CelebA Liu et al. (2015) accuracy Geomean
Baseline FL - 0.5418 333Taobao’s baseline AUC is 0.26% less than the baseline FL result presented at Niu et al. (2020), potentially due to simpler model architecture and freezed pre-trained embedding tables. 90.75 -
FEL (Mean Best) (+0.27%) 0.5522 (+1.92%) 91.68 (+1.02%) (+1.07%)
FEL (Median Best) (+0.29%) 0.5459 (+0.74%) 91.35 (+0.66%) (+0.56%)
FEL (Max Best) (-0.06%) 0.5418 (-0.1%) 91.46 (+0.78%) (+0.21%)
FEL (NN-based Best) (+0.43%) 0.5544 (+2.31%) 92.16 (+1.55%) (+1.43%)

Table 2: FEL’s prediction accuracy improvement over the baseline FL for different datasets. Following common practice of each dataset, Taobao uses AUC and CelebA uses accuracy as their metric. Production data’s baseline accuracy is not disclosed.

Among the different ensemble aggregation methods, adding an over-arch NN layer provided the best prediction quality improvement, followed by mean and median. Max only showed improvement in CelebA and did not show benefit in recommendation use-cases.

5.2 Prediction Quality Improvement of Different Clustering Methods

Figure 2: Accuracy improvement for different number of clusters (segments) for each ensemble aggregation method (left), and different clustering methods when using over-arch NN layer (right). Different clustering methods are explain in Table 1

Effects of the Number of Clusters: To understand the effect of the number of client clusters in the final model quality improvement, we varied the number of clusters in the Taobao dataset while using random clustering. Figure 2 (left) summarizes the result. There is an optimal setting for the number of clusters used in FEL. Going beyond the optimal setting for the number of clusters results in worse model accuracy. When the number of clusters is too small, the final model capacity is limited as there are not enough leaf models to ensemble. If the number of clusters is too large, each leaf model cannot learn enough information as the clients in each cluster are too few. The optimal number of clusters depends on the number of available devices that participate within each cluster and, here, the number of partitions can be treated as a hyperparameter Kim and Wu (2021).

Effects of Features Used in Clustering: We also varied the clustering methods for each dataset and observed their effect on the final model accuracy. We explored different clustering methods for different datasets and presented the best performing methods. Table 1 summarizes the clustering methods that we explored. Here, we show the result for the best performing over-arch NN-based ensemble aggregation for brevity. For the production dataset, we used user age, the app category where the ad was displayed, location (larger geographic regions), and previous click ratio of the users to cluster the users. For Taobao dataset, we used user age, city level, and consumption level. For CelebA, we clustered the 40 binary attributes of each user using K-means clustering or simply used the number of present attributes.

Figure 2 (right) shows that clustering can affect the final model accuracy significantly. For the production dataset, clustering using the click ratio (Clustering 4) showed the best accuracy. For Taobao, clustering with city level showed the best accuracy (Clustering 3). For CelebA, using K-means clustering was the best (Clustering 3). The results show that clustering methods as well as the number of clusters are two important hyperparameters of FEL.

5.3 Evaluation Results with Differential Privacy

Config Mean Median Max Over-Arch NN
Clustering 1 +1.27% -0.23% -0.10% +1.86%
3.78 +0.63% -0.14% +0.11% +0.66%
1.56 +0.32% -0.23% +0.34% +0.68%
Clustering 2 +1.92% -0.46% -1.64% +2.29%
3.78 +0.75% -0.21% +0.03% +1.38%
1.56 +0.69% -0.11% +0.74% +0.76%
Clustering 3 +1.86% +0.74% -2.07% +2.31%
3.78 +1.49% -0.09% +1.03% +1.93%
1.56 +0.71% -0.37% +1.26% +1.02%

Table 3: Taobao dataset with DP. Percentage of FEL’s accuracy improvement over the FL baseline with the same level of DP noise is shown. Table 1 explains the clustering configurations.

Table 3 shows the accuracy improvement of FEL compared to vanilla FL for two different levels of DP noise, along with the case of no DP noise (). We assume the over-arch NN layer was trained with opt-in data and no DP noise added when training the over-arch NN layer. Table 3 shows that even when DP noise is added, FEL shows meaningful accuracy improvement over vanilla FL. Again, we observe that the over-arch NN layer and mean aggregations still provide the most significant gains. However, smaller

leads to reduced accuracy gain, possibly due to larger injected noise. Another interesting observation is that the max ensemble aggregation improves the accuracy when DP noise is added, unlike the no DP noise case where it did not show any improvement. One possible reason is because DP noise mitigates the effects of outliers in training.

6 Related Work

This study resides in the intersection of four areas of study: ensemble distillation, boosted federated learning, local ensemble learning, and ensemble aggregation.

Ensemble Distillation. Lin et al. Lin et al. (2020)

propose FedDF that uses unlabeled data generated by a generative model to aggregate knowledge from all heterogeneous client models, rather than leveraging FedAVG. This model uses average logit and fusion to share learning between heterogeneous models. Gong et al. 

Gong et al. (2022) focus on communication efficiency and privacy guarantee with one-shot offline knowledge distillation. This proposal keeps the local training asynchronous and independent, and then aggregates the local predictions on unlabeled cross-domain public data. Sui et al. Sui et al. (2020) explore a knowledge distillation strategy which uses the uploaded predictions of ensemble local models to train the central model without requiring uploading local parameters. This approach only uses predicted labels on a small dataset to learn a student model from an ensemble of multiple local teacher models.

Boosted Federated Learning. Boosting and bagging are two prominent approaches for model ensemble learning. Li et al. Li et al. (2020a) distribute data samples with the same features among multiple parties, relaxing privacy concerns. In their approach, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. In Hamer et al. Hamer et al. (2020) work, an ensemble of pre-trained based predictors is trained via federated learning, thus saving on communication costs. Luo et al. Luo et al. (2021)

suggest gradient boosting decision tree (GBDT) method, which takes the average gradient of similar samples and its own gradient as a new gradient to improve the accuracy of the local model.

Local Ensemble Learning. Shi et al. Shi et al. (2021) propose FedEnsemble which uses random permutations to update a group of models, and then obtains predictions through model averaging, instead of aggregating local models to update a single global model. Similarly, Majeed et al. Majeed and Hong (2020) suggest an ensemble learning FL regime in which five base FL models are trained using the same local datasets, and ensemble using simple majority voting rule. Attota et al. Attota et al. (2021) propose MV-FLID, which is a multi-view ensemble learning which helps in maximizing the learning efficiency of different classes of attacks for intrusion detection tasks.

Ensemble Aggregation. Chen et al. Chen and Chao (2020)

suggest FedBE, which takes a Bayesian inference perspective by sampling higher-quality global models and combining them via Bayesian ensemble for robust aggregation. Guha et al. 

Guha et al. (2019) propose one-shot learning, where a central server learns a global model over a network of federated devices in a single round of communication. Liu et al. Liu et al. (2020b) suggest FedGRU, for business-to-business (B2B) setting, which uses both secure parameter aggregation and cluster ensembles to scale. Bian et al. Bian et al. (2021)

extend federated ensembles to the context of semi-supervised learning, leveraging self-ensembles, to enable clients to label their own data. Orhobor et al. 

Orhobor et al. (2020) suggest assigning users into pre-specified bins and train different regressors on each bin, which are later ensembled for precision medicine.

Although these methods have their own merits, they do not address the problem of the recommender and ranking systems use cases, in which each user has only a small number of examples, and require user-level privacy guarantee. As a result, none of these studies leverages the variation across users and diversity of behavior in their proposals. Our approach clusters the users and trains different models on different users, leveraging a large user base in recommender and ranking systems. Furthermore, our user partitioning, feature generation by last hidden layer, and training over-arch model provides extra gains both in terms of precision and privacy budget.

7 Conclusion

While Federated Learning is gaining traction for select applications, it cannot be directly adopted to ranking and recommendation tasks that require large model size. We introduce Federated Ensemble Learning (FEL), which increases the learning capacity of FL models. FEL clusters client population and trains a leaf model per each cluster, which are later ensembled to form a larger inference model. FEL can be trained efficiently without introducing significant privacy concerns and can improve the prediction accuracy meaningfully compared to vanilla FL. FEL enables FL for demanding ranking and recommendation tasks.

As future work we plan to evaluate FEL on different model architectures, such as transformer, LSTM, RNN, or CNN and on other tasks such as speech recognition, reinforcement learning. FL-based design must also consider the inter-dependence of data and system heterogeneity, as observed in real-world, large-scale federated recommendation learning 

Maeng et al. (2022). Furthermore, we plan to integrate unsupervised clustering approaches, so that the segmentation can happen automatically to optimise FEL’s performance. Finally, to minimise the cost of managing a number of leaf models, we plan to study automated ways to assess the quality of each leaf model to pinpoint under-represented clusters and seek possible mitigation such as dynamic clustering and leaf retraining.


We would like to thank Milan Shen, Will Bullock, Hung Duong, and Kim Hazelwood for supporting the work.


  • S. AbdulRahman, H. Tout, H. Ould-Slimane, A. Mourad, C. Talhi, and M. Guizani (2020) A survey on federated learning: the journey from centralized to distributed on-site learning and beyond. IEEE Internet of Things Journal 8 (7), pp. 5476–5497. Cited by: §1.
  • B. Acun, M. Murphy, X. Wang, J. Nie, C. Wu, and K. Hazelwood (2021) Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High Performance Computer Architecture (HPCA), Cited by: §4.1.
  • D. C. Attota, V. Mothukuri, R. M. Parizi, and S. Pouriyeh (2021) An ensemble multi-view federated learning intrusion detection for IoT. IEEE Access 9, pp. 117734–117745. Cited by: §6.
  • T. Beane and D. Ennis (1987) Market segmentation: a review. European journal of marketing. Cited by: §1, §3.1.
  • J. Bian, Z. Fu, and J. Xu (2021) FedSEAL: semi-supervised federated learning with self-ensemble learning and negative learning. arXiv preprint arXiv:2110.07829. Cited by: §6.
  • K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2017) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191. Cited by: §1.
  • S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar (2018) Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210. Cited by: §2.
  • H. Chen and W. Chao (2020) FedBE: making Bayesian model ensemble applicable to federated learning. arXiv preprint arXiv:2009.01974. Cited by: §6.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §4.1.
  • Y. J. Cho, A. Manoel, G. Joshi, R. Sim, and D. Dimitriadis (2022) Heterogeneous ensemble knowledge transfer for training large models in federated learning. arXiv preprint arXiv:2204.12703. Cited by: §2.
  • E. Diao, J. Ding, and V. Tarokh (2021) HeteroFL: computation and communication efficient federated learning for heterogeneous clients. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §2.
  • C. Dwork and J. Lei (2009) Differential privacy and robust statistics. In

    Proceedings of the forty-first annual ACM symposium on Theory of computing

    pp. 371–380. Cited by: §3.2.
  • R. C. Geyer, T. Klein, and M. Nabi (2017) Differentially private federated learning: A client level perspective. CoRR abs/1712.07557. External Links: Link, 1712.07557 Cited by: §2.
  • B. Ghazi, N. Golowich, R. Kumar, P. Manurangsi, and C. Zhang (2021) Deep learning with label differential privacy. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Link Cited by: §1, §2, §2.
  • X. Gong, A. Sharma, S. Karanam, Z. Wu, T. Chen, D. Doermann, and A. Innanje (2022) Preserving privacy in federated learning with ensemble cross-domain knowledge distillation. Cited by: §6.
  • Grouplens (2016) MovieLens 20M dataset. External Links: Link Cited by: §2.
  • N. Guha, A. Talwalkar, and V. Smith (2019) One-shot federated learning. arXiv preprint arXiv:1902.11175. Cited by: §6.
  • J. Hamer, M. Mohri, and A. T. Suresh (2020) FedBoost: a communication-efficient algorithm for federated learning. In International Conference on Machine Learning, pp. 3973–3983. Cited by: §6.
  • K. Hao (2020) How Apple personalizes Siri without hoovering up your data. Technology Review. Cited by: §1.
  • F. Hartmann, S. Suh, A. Komarzewski, T. D. Smith, and I. Segall (2019) Federated learning for ranking browser history suggestions. arXiv preprint arXiv:1911.11807. Cited by: §1.
  • C. He, M. Annavaram, and S. Avestimehr (2020) Group knowledge transfer: federated learning of large CNNs at the edge. Advances in Neural Information Processing Systems 33, pp. 14068–14080. Cited by: §2.
  • S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. Venieris, and N. Lane (2021) FjORD: fair and accurate federated learning under heterogeneous targets with ordered dropout. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • D. Huba, J. Nguyen, K. Malik, R. Zhu, M. Rabbat, A. Yousefpour, C. Wu, H. Zhan, P. Ustinov, H. Srinivas, et al. (2022) Papaya: practical, private, and scalable federated learning. Proceedings of Machine Learning and Systems 4. Cited by: §1, §1.
  • Y. G. Kim and C. Wu (2021) AutoFL: enabling heterogeneity-aware energy efficient federated learning. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’21. Cited by: §5.2.
  • N. K. Le, Y. Liu, Q. M. Nguyen, Q. Liu, F. Liu, Q. Cai, and S. Hirche (2021) FedXGBoost: privacy-preserving XGBoost for federated learning. arXiv preprint arXiv:2106.10662. Note:

    Presented at International Workshop on Federated and Transfer Learning for Data Sparsity and Confidentiality (FTL-IJCAI’21).

    Cited by: §3.
  • D. Li and J. Wang (2019) FedMD: heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581. Cited by: §2.
  • L. Li, J. Hong, S. Min, and Y. Xue (2021) A novel CTR prediction model based on DeepFM for Taobao data. In

    2021 IEEE International Conference on Artificial Intelligence and Industrial Design (AIID)

    pp. 184–187. Cited by: §4.1.
  • Q. Li, Z. Wen, and B. He (2020a) Practical federated gradient boosting decision trees. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34, pp. 4642–4649. Cited by: §6.
  • Y. Li, Y. Zhou, A. Jolfaei, D. Yu, G. Xu, and X. Zheng (2020b) Privacy-preserving federated learning framework based on chained secure multiparty computing. IEEE Internet of Things Journal 8 (8), pp. 6178–6186. Cited by: §2.
  • T. Lin, L. Kong, S. U. Stich, and M. Jaggi (2020) Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems 33, pp. 2351–2363. Cited by: §2, §6.
  • Y. Liu, A. Huang, Y. Luo, H. Huang, Y. Liu, Y. Chen, L. Feng, T. Chen, H. Yu, and Q. Yang (2020a) FedVision: an online visual object detection platform powered by federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 13172–13179. Cited by: §1.
  • Y. Liu, J. James, J. Kang, D. Niyato, and S. Zhang (2020b) Privacy-preserving traffic flow prediction: a federated learning approach. IEEE Internet of Things Journal 7 (8), pp. 7751–7763. Cited by: §6.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §1, §4.1, §4.1, Table 1, Table 2.
  • M. Lui, Y. Yetim, Ö. Özkan, Z. Zhao, S. Tsai, C. Wu, and M. Hempstead (2021) Understanding capacity-driven scale-out neural recommendation inference. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Cited by: §1, §4.1.
  • C. Luo, X. Chen, J. Xu, and S. Zhang (2021) Research on privacy protection of multi source data based on improved GBDT federated ensemble method with different metrics. Physical Communication 49, pp. 101347. Cited by: §6.
  • K. Maeng, H. Lu, L. Melis, J. Nguyen, M. Rabbat, and C. Wu (2022) Towards fair federated recommendation learning: characterizing the inter-dependence of system and data heterogeneity. arXiv preprint:2206.02633. Cited by: §7.
  • U. Majeed and C. S. Hong (2020) Blockchain-assisted ensemble federated learning for automatic modulation classification in wireless networks. Proc. of the KIISE Korea Computer Congress, pp. 756–758. Cited by: §6.
  • M. Malek, I. Mironov, K. Prasad, I. Shilov, and F. Tramer (2021) Antipodes of label differential privacy: PATE and ALIBI. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
  • H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang (2017)

    Learning differentially private recurrent language models

    arXiv preprint arXiv:1710.06963. Cited by: §2, §2, §3.2, §4.2.
  • A. Mondal, Y. More, R. H. Rooparaghunath, and D. Gupta (2021) Flatee: federated learning across trusted execution environments. arXiv preprint arXiv:2111.06867. Cited by: §2.
  • M. Nalpas and S. Dutton (2020) A more private way to measure ad conversions, the event conversion measurement API. Cited by: §2.
  • M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, et al. (2019) Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091. Cited by: §2, §4.1.
  • J. Nguyen, K. Malik, H. Zhan, A. Yousefpour, M. Rabbat, M. Malek, and D. Huba (2021) Federated learning with buffered asynchronous aggregation. arXiv preprint arXiv:2106.06639. Cited by: §4.1, §4.2, §5.1.
  • C. Niu, F. Wu, S. Tang, L. Hua, R. Jia, C. Lv, Z. Wu, and G. Chen (2020) Billion-scale federated learning on mobile clients: a submodel design with tunable privacy. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pp. 1–14. Cited by: §2, §4.1, §4.1, footnote 3.
  • O. I. Orhobor, L. N. Soldatova, and R. D. King (2020) Federated ensemble regression using classification. In International Conference on Discovery Science, pp. 325–339. Cited by: §6.
  • D. Pasquini, G. Ateniese, and M. Bernaschi (2021) Unleashing the tiger: inference attacks on split learning. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 2113–2129. Cited by: §1, §2.
  • M. Paulik, M. Seigel, H. Mason, D. Telaar, J. Kluivers, R. van Dalen, C. W. Lau, L. Carlson, F. Granqvist, C. Vandevelde, et al. (2021) Federated evaluation and tuning for on-device personalization: system design & applications. arXiv preprint arXiv:2102.08503. Cited by: §1.
  • J. J. Pfeiffer III, D. Charles, D. Gilton, Y. H. Jung, M. Parsana, and E. Anderson (2021) Masked LARk: masked learning, aggregation and reporting workflow. arXiv preprint arXiv:2110.14794. Cited by: §2.
  • M. G. Poirot, P. Vepakomma, K. Chang, J. Kalpathy-Cramer, R. Gupta, and R. Raskar (2019) Split learning for collaborative deep learning in healthcare. arXiv preprint arXiv:1912.12115. Cited by: §1, §2.
  • A. M. Prasad, L. R. Iverson, and A. Liaw (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9 (2), pp. 181–199. Cited by: §3.1.
  • N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, et al. (2020) The future of digital health with federated learning. NPJ digital medicine 3 (1), pp. 1–7. Cited by: §1.
  • P. Sabnagapati (2020) Ad display/click data on taobao.com. External Links: Link Cited by: §1, §2, §5.1, Table 1, Table 2.
  • N. Shi, F. Lai, R. A. Kontar, and M. Chowdhury (2021) Fed-ensemble: improving generalization through model ensembling in federated learning. arXiv preprint arXiv:2107.10663. Cited by: §6.
  • D. Sui, Y. Chen, J. Zhao, Y. Jia, Y. Xie, and W. Sun (2020) FedED: Federated learning via ensemble distillation for medical relation extraction. In

    Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP)

    pp. 2118–2128. Cited by: §6.
  • S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, and Y. Zhou (2019) A hybrid approach to privacy-preserving federated learning - (extended abstract). Inform. Spektrum 42 (5), pp. 356–357. Cited by: §2.
  • P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar (2018) Split learning for health: distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564. Cited by: §1, §2.
  • R. Wang, B. Fu, G. Fu, and M. Wang (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 1–7. Cited by: footnote 1, footnote 2.
  • K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor (2020) Federated learning with differential privacy: algorithms and performance analysis. IEEE Transactions on Information Forensics and Security 15, pp. 3454–3469. Cited by: §2.
  • M. Wilkening, U. Gupta, S. Hsia, C. Trippel, C. Wu, D. Brooks, and G. Wei (2021) RecSSD: near data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Cited by: §4.1.
  • W. Zhao, D. Xie, R. Jia, Y. Qian, R. Ding, M. Sun, and P. Li (2020) Distributed hierarchical GPU parameter server for massive scale deep learning ads systems. Proceedings of Machine Learning and Systems 2, pp. 412–428. Cited by: §1, §4.1, footnote 1, footnote 2.
  • G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §4.1.