Fed-ensemble: Improving Generalization through Model Ensembling in Federated Learning

07/21/2021 ∙ by Naichen Shi, et al. ∙ 0

In this paper we propose Fed-ensemble: a simple approach that bringsmodel ensembling to federated learning (FL). Instead of aggregating localmodels to update a single global model, Fed-ensemble uses random permutations to update a group of K models and then obtains predictions through model averaging. Fed-ensemble can be readily utilized within established FL methods and does not impose a computational overhead as it only requires one of the K models to be sent to a client in each communication round. Theoretically, we show that predictions on newdata from all K models belong to the same predictive posterior distribution under a neural tangent kernel regime. This result in turn sheds light onthe generalization advantages of model averaging. We also illustrate thatFed-ensemble has an elegant Bayesian interpretation. Empirical results show that our model has superior performance over several FL algorithms,on a wide range of data sets, and excels in heterogeneous settings often encountered in FL applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rapid increase in computational power on edge devices has set forth federated learning (FL) as an elegant alternative to traditional cloud/data center based analytics. FL brings training to the edge, where devices collaboratively extract knowledge and learn complex models (most often deep learning models) with the orchestration of a central server while keeping their personal data stored locally. This paradigm shifts not only reduces privacy concerns but also sets forth many intrinsic advantages including cost efficiency, diversity, and reduced communication, amongst many others

[yang2019federated, kairouz2019advances].

The earliest and perhaps most popular FL algorithm is FederatedAveraging (fedavg) [fedavg]. In fedavg, the central server broadcasts a global model (set of weights) to selected edge devices, these devices run updates based on their local data, and the server then takes a weighted average of the resulting local models to update the global model. This process iterates over hundreds of training rounds to maximize the performance for all devices. Fedavg has seen prevailing empirical successes in many real-world applications [servicefl, hard2018federated]

. The caveat, however, is that aggregating local models is prone to overfitting and suffers from high variance in learning and prediction when local datasets are heterogeneous be it in size or distribution or when clients have limited bandwidth, memory, or unreliable connection that effects their participation in the training process

[nishio2019client, moreno2012unifying]. Indeed, in the past few years, multiple papers have shown the high variance in the performance of FL algorithms and their vulnerability to overfitting, specifically in the presence of data heterogeneity or unreliable devices [jiang2019improving, wang2019federated, smith2017federated, nishio2019client, moreno2012unifying]. Since then some notable algorithms have attempted to improve the generalization performance of fedavg and tackle the above challenges as discussed in Sec. 2.

In this paper, we adopt the idea of ensemble training to FL and propose Fed-ensemble which iteratively updates an ensemble of models to improve the generalization performance of FL methods. We show, both theoretically and empirically, that ensembling is efficient in reducing variance and achieving better generalization performance in FL. Our approach is orthogonal to current efforts in FL aimed at reducing communication cost, heterogeneity or finding better fixed points as such approaches can be directly integrated into our ensembling framework. Specifically, we propose an algorithm to train different models. Predictions are then obtained by ensembling all trained modes, i.e. model averaging. Our contributions are summarized below:

  • Model: We propose an ensemble treatment to FL which updates models over local datasets. Predictions are obtained by model averaging. Our approach does not impose additional burden on clients as only one model is assigned to a client at each communication round. We then show that Fed-ensemble has an elegant Bayesian interpretation.

  • Convergence and Generalization: We motivate the generalization advantages of ensembling under a bias-variance decomposition. Using neural tangent kernels (NTK), we then show that predictions at new data points from all models converge to samples from the same limiting Gaussian process in sufficiently overparameterized regimes. This result in turn highlights the improved generalization and reduced variance obtained from averaging predictions as all

    models converge to samples from that same limiting posterior. To the best of our knowledge, this is the first theoretical proof for the convergence of general multilayer neural network in FL in the kernel regime, as well as the first justification for using model ensembling in FL. Our proof also offers standalone value as it extends NTK results to FL settings.

  • Numerical Findings: We demonstrate the superior performance of Fed-ensemble over several FL techniques on a wide range of datasets, including realistic FL datasets in FL benchmarks [fedscale]. Our results highlight the capability of Fed-ensemble to excel in heterogeneous data settings.

2 Related Work

Single-mode FL

Many approaches have been proposed to tackle the aformentioned FL challenges. Below we list a few, yet this is by no means an extensive list. Fedavg [fedavg]

allows inexact local updating to balance communication vs. computation on resource-limited edge devices, while reporting decent performance on mitigating performance bias on skewed client data distributions.

Fedprox [fedprox] attempts to solve the heterogeneity challenge by adding a proxy term to control the shift of model weights between the global and local client model. This proxy term can be viewed as a normal prior distribution on model weights. There are several influential works aiming at expediting convergence, like FedAcc [fedacc], FedAdam and FedYogi [fedadam], reducing communication cost, like DIANA [diana], DORE [dore], or find better fixed points, like FedSplit [fedsplit], FedPD [fedpd], FedDyn [feddyn]. As aformentioned, these efforts are complementary to our work, and they can be integrated into our ensembling framework.

Ensemble of deep neural nets Recently ensembling methods in conventional, non-federated, deep learning has seen great success. Amongst them, [fastensembling] analyzes the loss surface of deep neural networks and uses cyclic learning rate to learn multiple models and form ensembles. [deepensemble] visually demonstrates the diversity of randomly initialized neural nets and empirically shows the stability of ensembled solutions. Also, [multiswag] connects ensembling to Bayesian deep neural networks and highlights the benefits of ensembling.

Bayesian methods in FL and beyond There are also some recent attempts to exploit Bayesian philosophies in FL. Very recently, Fedbe was proposed [fedbe]

to couple Bayesian model averaging with knowledge distillation on the server. Fedbe formulates the Gaussian or Dirichlet distribution of local models and then uses knowledge distillation on the server to update the global model. This procedure however requires additional datasets on the server and a significant computational overhead, thus being demanding in FL. Besides that, Bayesian non-parametrics have been investigated for advanced model aggregation through matching and re-permuting neurons using neuron matching algorithms

[fedma, bnfed]. Such approaches intend to address the permutation invariance of parameters in neural networks, yet suffer from a large computational burden. We also note that Bayesian methods have been also exploited in meta-learning which can achieve personalized FL. For instance, [bayesianmaml] proposes a Bayesian version of MAML [finn2017model] using Stein variational gradient descent [svgd]. An amortized version of this [amortizedmaml] utilizes variational inference (VI) across tasks to learn a prior distribution over neural network weights.

3 Fed-ensemble

3.1 Parameter updates through Random Permutation

Let denote the number of clients where the local dataset of client is given as and is the number of observations for client . Also, let be the union of all local datasets , and

denote the model to be learned parametrized by weight vector


Design principle

Our goal is to get multiple models ( engaged in the training process. Specifically, we use models in the ensemble, where is a predetermined number. The models are randomly initialized by standard initialization, e.g. Xavier [xavierinitialization] or He [heinitialization]. In each training round, Fed-ensemble assigns one of the models to the selected clients to train on their local data. The server then aggregates the updated models from the same initialization. All models eventually learn from the entire dataset, and allow an improvement to the overall inference accuracy by averaging over predictions produced by each model. Hereon, we use mode or weight to refer to and model to refer to the corresponding .

Objective of ensemble training

Since we aim to learn modes, the objective of FL training can be simply defined as:


where , is the local empirical loss on client , , is a weighting factor for each client and

is a loss function such as cross entropy.

1:  Input: Client datasets , , , , initialization for
2:  Randomly divide clients into strata
3:  for  = 1, 2,  do
4:     Index Permutation: = shuffle_list
5:     for  = 1, 2,  do
6:        for  = 1, 2,  do
7:           Randomly select clients from
8:           Server broadcasts mode to clients
9:           for each client in  do
11:              Client sends to server
12:           end for
13:        end for
14:        server_update [].
15:     end for
16:  end for
Algorithm 1 Fed-ensemble: Fed-ensemble using random permutations

Model training with Fed-ensemble

We now introduce our algorithm Fed-ensemble (shown in Algorithm 1), which is inspired by random permutation block coordinate descent used to optimize (1). We first, randomly divide all clients into strata, . Now let denote an individual communication round where clients are sampled in each round. Also, let denote the training age, where each age consists of communication rounds. Hence the total number of communication rounds for ages is . At the beginning of each training age, every stratum will decide the order of training in this age. To do so, define a permutation matrix of size such that at each age the rows of are given as . More specifically, in the -th communication rounds in this age, the server samples some clients from each stratum. Clients from stratum get assigned mode as their initialization for and then do a training procedure on their local data, denoted as . Note that the use of the random permutation matrix , not only ensures that modes are trained on diverse clients but also guarantees that every mode is downloaded and trained on all strata in one age. Upon receiving updated models from all clients, the server activates server_update to calculate the new [].

Remark 3.1.

The simplest form of the server_update function is to average modes from the same initialization: , where if client downloads model at communication round , and 0 otherwise. can also be obtained from . This approach is an extension of fedavg. However, one can directly utilize any other scheme in server_update to aggregate modes from the same initialization. Similarly different local training schemes can be used within local_training.

Remark 3.2.

The use of multiple modes does not increase computation or communication overhead on clients compared with single-mode FL algorithms, as for every round, only one mode is sent to and trained on each client. However, after carefully selecting modes, models trained by Fed-ensemble can efficiently learn statistical patterns on all clients as proven in the convergence and generalization results in Sec. 4.

Model prediction with Fed-ensemble: After the training process is done, all modes are sent to the client, and model prediction at a new input point is achieved via ensembling.


3.2 Bayesian Interpretation of Fed-ensemble

Interestingly, Fed-ensemble

has an elegant Bayesian interpretation as a variational inference (VI) approach to estimate a posterior Gaussian mixture distribution over model weights. To view this, let model parameters

admit a posterior distribution . Under VI, a variational distribution is used to approximate by minimizing the KL-divergence between the two:


Now, if we take as a mixture of isotropic Gaussians, , where ’s are the variational parameters to be optimized. When ’s are well separated, does not depend on . In the limit , becomes the linear combination of Dirac-delta functions , then . Finally, notice that data on different clients are usually independent, therefore the log posterior density factoririzes as: . If we take the loss function in (1) to be , we then recover (1). This Bayesian view highlights the ensemble diversity as the modes can be viewed as modes of a mixture Gaussian. Further details on this Bayesian interpretation can be found in the Appendix.

4 Convergence and limiting behavior of sufficiently over parameterized neural networks

In this section we present theoretical results on the generalization advantages of ensembling. We show that one can improve generalization performance by reducing predictive variance. Then we analyze sufficiently overparametrized networks trained by Fed-ensemble. We prove the convergence of the training loss, and derive the limiting model after sufficient training, from which we show how generalization can be improved using Fed-ensemble.

4.1 Variance-bias decomposition

We begin by briefly reviewing the bias-variance decomposition of a general regression problem. We use to parametrize the hypothesis class , to denote the average of under some distribution , i.e. . Similarly is defined as , then:


where the expectation over is taken under some input distribution , and that over is taken under . In (3), the first term represents the intrinsic noise in data, referred to as data uncertainty. The second term is the bias term which represents the difference between expected predictions and the expected predicted variable . The third term characterizes the variance from the discrepancies of different functions in the hypothesis class, also referred to as knowledge uncertainty in [gradientboosting]. In FL, this variance is often large due to heterogeneity, partial participation, etc. However, we will show that this variance decreases through model ensembling.

4.2 Convergence and variance reduction using neural tangent kernels

Inspired by recent work on neural tangent kernels [ntk, ntkgaussian], we analyze sufficiently overparametrized neural networks. We focus on regression tasks and define the local empirical loss in (1) as:


For this task, we will prove that when overparameterized neural networks are trained by Fed-ensemble, the training loss converges exponentially to a small value determined by stepsize, and for all converge to samples from the same posterior Gaussian Process () defined via a neural tangent kernel. Note that for space limitations we only provide an informal statement of the theorems, while details are relegated to the Appendix. Prior to stating our result we introduce some needed notations.

For notational simplicity we drop the subscript in , and use instead, unless stated otherwise. We let denote the initial value of , and denote after

local epochs of training. We also let

denote the initialization distribution for weights . Conditions for are found in the Appendix. We define an initialization covariance matrix for any input as . Also we denote the neural tangent kernel of a neural network to be , where represents the minimum width of neural network in each layer and denotes the number of trainable parameters. Indeed, [ntk] shows that this limiting kernel remains fixed during training if the width of every layer of a neural network approaches infinity and when the stepsize scales with : . We adopt the notation in [ntk] and extend the analysis to FL settings.

Below we state an informal statement of our convergence result.

Theorem 4.1.

(Informal) For the least square regression task , where is a neural network whose width l goes to infinity, , then under the following assumptions

  1. [label=()]

  2. is full rank i.e.,

  3. The norm of every input x is smaller than 1:

  4. The stationary points of all local losses coincide: leads to
      for all clients

  5. The total number of data points in one communication round is a constant,

when we use Fed-ensemble and local clients train via gradient descent with stepsize , the training error associated to each mode decreases exponentially

if the learning rate is smaller than a threshold.

Remark 4.1.

Assumptions (i) and (ii) are standard for theoretical development in NTK. Assumption (iii) can be derived directly from -local dissimilarity condition in [fedprox]. It is actually an overparametrization condition: it says that if the gradient of the loss evaluated on the entire union dataset is zero, the gradient of the loss evaluated on each local dataset is also zero. Here we note that recent work [fedpd, feddyn] have tried to provide FL algorithms that work well when this assumption does not hold. As aforementioned, such algorithms can be utilized within our ensembling framework. Assumption (iv) is added only for simplicity: it can be removed if we choose a stepsize according to number of datapoints in each round.

We would like to note that after writing this paper, we find that [flntk] show convergence of the training loss in FL under a kernel regime. Their analysis, however, is limited to a special form of 2-layer relu activated networks with the top layer fixed, while we study general multi-layer networks. Also this work is mainly concerned with theoretical understanding of FL under NTK regimes, while our overarching goal is to propose an algorithm aimed at ensembling, Fed-ensemble, and motivate its use through NTK.

More importantly and beyond convergence of the training loss, we can analytically calculate the limiting solution of sufficiently overparametrized neural networks. The following theorem shows that models in the ensemble will converge into independent samples in a Gaussian Process:

Theorem 4.2.

(Informal) If Fed-ensemble in Algorithm 1 is used to train for the regression task (1) and (4), then after sufficient communication rounds, functions can be regarded as independent samples from a , with mean and variance defined as

, and

, and represents the entire dataset .

Remark 4.2.

The result in Theorem 4.2 is illustrated in Fig. 1. The central result is that training modes with Fed-ensemble will lead to predictions , all of which are samples from the same posterior . The mean of this is the exact result of kernel regression using

, while the variance is dependent on initialization. Hence via Fed-ensemble one is able to obtain multiple samples from the posterior predictive distribution. This result is similar to the simple sample-then-optimize method by

[matthews2017sample] which shows that for a Bayesian linear model the action of sampling from the prior followed by deterministic gradient descent of the squared loss provides posterior samples given normal priors.

Figure 1: An illustration of an ensemble of 3 samples from the posterior distribution.

A direct consequence of the result above in given in the corollary below.

Corollary 4.1.

Let be a positive constant, if the assumptions in theorem 4.1 are satisfied, when we train with Fed-ensemble with modes, after sufficient iterations, we have that:


where is the maximum a posteriori solution of the limiting Gaussain Process obtained in Theorem 4.2. This corollary is obtained by simply combining Chebyshev inequality with Theorem 4.2 and shows that since the variance shrinks at the rate of , averaging over multiple models is capable of getting closer to .

We finally should note that there is a gap between neural tangent kernel and the actual neural network [cntk]. Yet this analysis still serves to highlight the generalization advantages of FL inference with multiple modes. Also, experiments show Fed-ensemble works well beyond the kernel regime assumed in Theorem 4.1.

5 Experiments

In this section we provide empirical evaluations of Fed-ensemble on five representative datasets of varying sizes. We start with a toy example to explain the bias-variance decomposition, then move to realistic tasks. We note that since ensembling is an approach yet to be fully investigated in FL, we dedicate many experiments to a proof of concept.

A simple toy example with kernels

We start with a toy example on kernel methods that illustrates the benefits of using multiple modes and reinforces the key conclusions from Theorem 4.2. We create 50 clients and generate the data of each client following a noisy sine function , where , and denotes the parameter unique for each client that is sampled from a random distribution . On each client we sample 2 points of uniformly in . We use the linear model , where is a radial basis kernel defined as: . In this badis function, and ’s are 100 uniformly randomly sampled parameters from that remain constant during training. Note that the expectation of the generated function is .

(a) Prediction performance of Fed-ensemble.
(b) Prediction performance of Fedavg.
Figure 2: Linear model on toy dataset. Dots represent datapoint from a subset of clients. The green area denotes predictive interval obtained from individual models, , estimated using Fed-ensemble. The “averaged" reports the final prediction obtained after model averaging in Fed-ensemble.

We report our predictive results in Table 1 and Fig. 2. Despite the diversity of individual mode predictions (green area), upon averaging, the averaged result becomes more accurate than a fully trained single mode (Fig. 2 (a,b)). Hence, Fed-ensemble is able to average out the noise introduced by individual modes. As a more quantitative measurement, in Table 1 we vary and calculate the bias-variance decomposition in (3) in each case. As seen in the table, the variance of Fed-ensemble can be efficiently reduced compared with a single mode approach such as FedAvg.

K=1(FedAvg) K=2 K=10 K=20 K=40
Bias 0.109 0.117 0.112 0.113 0.112

0.0496 0.0115 0.0063 0.0045 0.0042
Table 1: Bias-variance decomposition on the toy regression model. We fix training data, and run each algorithm 100 times from random initializations. We take the average of 100 models as in (3). Bias and variance are calculated accordingly. When we increase from 1 to 40, bias almost remains in the same level, while variance decays at a rate slightly slower than . Variance ceases to decrease after is larger than 20. This is intuitively understandable as when variance is very low, a larger dataset and number of communication rounds are needed to decrease it further.

Experimental setup

In our evaluation, we show that Fed-ensemble outperforms its counterparts across two popular but different settings: (i) Homogeneous setting: data are randomly shuffled and then uniformly randomly distributed to clients; (ii) Heterogeneous setting: data are distributed in a non-i.i.d manner: we firstly sort the images by label and then assign them to different clients to make sure that each client has images from exactly 2 labels. We experiment with five different datasets of varying sizes:

  • [topsep=2pt, partopsep=0pt, leftmargin=1.5em]

  • MNIST: a popular image classification dataset with 55k training samples across 10 categories.

  • CIFAR10: a dataset containing 60k images from 10 classes.

  • CIFAR100: a dataset with the similar images of CIFAR10 but categorized into 100 classes.

  • Shakespeare: the complete text of William Shakespeare with 3M characters for next work prediction.

  • OpenImage: a real-world image dataset with 1.1M images of 600 classes from 13k image uploaders [kuznetsova2020open]. We use the realistic distribution of client data in FedScale benchmark [fedscale].

In our experiments, we use modes by default, except in the sensitivity analysis, where we vary . Initial learning rates for clients are chosen based on the existing literature for each method.

We compare our model with Fedavg, Fedprox, Fedbe and Fedbe without knowledge distillation, where we use random sampling to replace knowledge distillation. Some entries in Fedbd/Fedbe-noKD columns are missing either because it’s impractical (dataset on the central server does not exist) or we cannot finetune the hyper-parameters to achieve reasonable performance.

MNIST dataset.

We train a 2-layer convolutional neural network to classify different labels. Results in Table

2 show that Fed-ensemble outperforms all benchmarked single mode FL algorithms. This confirms the effectiveness of ensembling over single mode methods in improving generalization. Fedbe turns out to perform closely to single mode algorithms eventually. We conjecture that this happens as the diversity of local models is lost in the knowledge distillation step. To show diversity of models in the ensemble, we plot the the projection of the loss surface on a plane spanned by 3 modes from the ensemble at the end of training in Fig 3. Fig 3 shows that modes () have rich diversity: different modes correspond to different local minimum with a high loss barrier on the line connecting them.

Figure 3: Projection of loss surface to a plane spanned by 3 modes of the ensemble trained by Fed-ensemble. Color represents the logarithm of cross-entropy loss on the entire training set.

In the non-i.i.d setting, the performance gap between Fed-ensemble and single mode algorithms becomes larger compared with i.i.d settings. This shows that Fed-ensemble can better fit more variant data distributions compared with Fedavg and Fedprox. This result is indeed expected as ensembling excels in stabilizing predictions [multiswag].

CIFAR/Shakespeare/OpenImage dataset.

We use ResNet [resnet]

for CIFAR, a character-level LSTM model for the sequence prediction task, and ShuffleNet

[shufflenet] and MobileNet [mobilenet] for OpenImage. Results in Table 2 show that Fed-ensemble can indeed improve the generalization performance. Further details on the experimental setup can be found in the Appendix.

Testing acc (%) Fedavg Fedprox Fedbe Fedbe-nokd Fed-ensemble
MNIST-iid - 97.75

- - 95.44


- - 71.18

- 58.09

- 62.49

OpenImage mobile net
51.850.17 52.930.14 - - 53.920.19

OpenImage shuffle net
53.980.16 54.420.22 - - 55.750.25

Table 2: Testing accuracy of models trained by different FL algorithms on five datasets. Fedbe-Nokd denotes Fedbe without knowledge distillation.

Effect of non-i.i.d client data distribution

N.o.C. Fedavg Fedprox Fed-ensemble Gap
2 66.19 66.87 71.45 4.58
4 83.90 84.40 86.12 1.72
6 85.90 86.10 87.14 1.04
8 85.90 86.06 87.33 1.27
10 86.52 86.77 87.94 1.17

Table 3: Sensitivity analysis with different assigned N.o.C on CIFAR-10. Gap denotes the difference in testing accuracy between Fed-ensemble and Fedprox.

We change the number of classes (N.o.C.) assigned to each client from 10 to 2 on the CIFAR-10 dataset. Conceivably, when each client has fewer categories, the data distribution is more variant. The results are shown in Table 3. As expected, the performance of all algorithms degrades with such heterogeneity. Further Fed-ensemble outperforms its counterparts and the performance gap becomes significantly clear as variance increases (i.e., smaller N.o.C.). This highlights the ability of Fed-ensemble to improve generalization specifically with heterogeneous clients.

Effect of number of modes .

Since the number of modes,

, in the ensemble is an important hyperparameter, we choose different values of

and test the performance on MNIST. We vary from 3 to 80. The results are shown in Table 4. Besides testing accuracy of ensemble predictions, we also calculate the accuracy and the entropy of the predictive distribution of each mode in the ensemble. As shown in Table 4, as increases from 3 to 40, the ensemble prediction accuracy increases as a result of variance reduction. However, when is very large, entropy increases and model accuracy drops slightly, suggesting that model prediction is less certain and accurate. The reason here is that when , the number of clients, hence datapoints, assigned to each mode significantly drop. This in turn decreases learning accuracy specifically when datasets are relatively small and a limited budget exists for communication rounds.

K=3 K=5 K=20 K=40 K=80
Test acc

Acc max
95.50 95.60 95.57 95.53 94.84

Acc min
94.97 95.00 94.13 93.54 92.74

Avg entropy
0.16 0.15 0.16 0.18 0.20

Table 4: Sensitivity of number of modes on MNIST. Acc max/min is the maximum/minimum testing accuracy across all modes in the ensemble. Avg Entropy is the average of the entropy of the empirical predictive distribution across all modes.

6 Conclusion & potential extensions

This paper proposes Fed-ensemble: an ensembling approach for FL which learns modes and obtains predictions via model averaging. We show, both theoretically and empirically, that Fed-ensemble is efficient in reducing variance and achieving better generalization performance in FL compared to single mode approaches. Beyond ensembling, Fed-ensemble may find value in meta-learning where modes acts as multiple learned initializations and clients download ones with optimal performance based on some training performance metric. This may be a route worthy of investigation. Another potential extension is instead of using random permutations, one may send specific modes to individual clients based on training loss or gradient norm, etc., from previous communication rounds. The idea is to increase assignments for modes with least performance or assign such modes to clients were they mal-performed in an attempt to improve the worst performance cases. Here it is interesting to understand if theoretical guarantees still hold in such settings.


7 Additional Experimental Details

In this section we articulate experimental details, and present round-to-accuracy figures to better visualize the training process. In all experiments, we decay learning rates by a constant of 0.99 after 10 communication rounds. On image datasets, we use stochastic gradient descent (SGD) as the client optimizer, while on the Shakespear dataset, we train using Adam. Every set of experiment can be done on single Tesla V100 GPU.

Fig 4 is the round-to-accuracy of several algorithms on MNIST dataset. Fed-ensemble outperforms all single mode federated algorithms and Fedbe.

(a) MNIST iid.
(b) MNIST noniid.
Figure 4: Training process on MNIST

Fig 6 shows the result on CIFAR10. Fed-ensemble has better generalization performance from the very beginning.

(a) CIFAR10 iid.
(b) CIFAR10 noniid.
Figure 5: Training process on CIFAR10
(a) Training process on CIFAR100
(b) Training process on Shakespeare
Figure 6: Larger datasets

Fig 6(a) plots round-to-accuracy on CIFAR100. Also, there is a clear gap between Fed-ensemble and FedAvg/FedProx.

The comparison between Fed-ensemble and single mode algorithms is even more conspicuous on Shakespeare. The next-word-prediction model we use has 2 LSTM layers, each with 256 hidden states. Fig 6(b) shows that FedAvg and FedProx suffers from severe overfitting when communication rounds exceed 20, while Fed-ensemble continues to improve. Fedbe is unstable on this dataset.

On OpenImage dataset, we simulate the real FL deployment in Google [google-fl], where we select 130 clients to participate in training, but only the updates from the first 100 completed clients are aggregated in order to mitigate stragglers.

8 Additional Details on Deriving the Variational Objective

As discussed in Sec. 3.2 the objective is to minimize the KL-divergence between the variational and true posterior distributions.


In this section, we use Gaussian mixture as variational distribution and then take limit to the result. Given (6), can be decomposed by conditional expectation law:

where Uniform(K)

denotes a uniform distribution on

. When is very small, different mode centers () are well-separated such that is close to zero outside a small vicinity of . Thus,


where is a constant. Thus this term is independent of . Now taking a first order Taylor expansion of around , we have

After taking expectation on , we have in the small limit. As a result, by conditional expectation, the KL-divergence reduces to:


where is a function of and is the prior p.d.f. of when .

In (8), acts as a regularizer on negative log-likelihood loss . When the prior is Gaussian centered at , the prior term simply becomes regularization.

9 Proof of theorem 4.1 and 4.2

In section, we discuss the training of linear models and neural networks in the sufficiently overparametrized regime, and prove theorem 4.1 and 4.2. This section is organized as following: in sec 9.1, we introduce some needed notations and prove some helper lemmas. In sec 9.2, we describe the dynamics of training linear model using Fed-ensemble. In sec 9.3, we prove the global convergence of training loss of Fed-ensemble, and show that neural tangent kernel is fixed during training of Fed-ensemble at the limit of width . The result is a refined version of theorem 4.1. In sec 9.4 we will show that the training can indeed be well approximated by kernel method in the extreme wide limit. Then combining sec 9.4 with sec 9.3, we can prove theorem 4.2.

9.1 Notations and Some lemmas

In this section we will introduce some notations, followed by some lemmas.

For an operator , we use

to denote its operator norm, i.e. largest eigenvalue of

. If is a matrix, is its Frobenius norm. Generally we have . is the smallest eigenvalue of .

The update of different modes in the ensemble described by algorithm 1 in the main text is decoupled, thus we can monitor the training of each mode separately and drop the subscript . We focus on one mode and use to denote the weight vector of this mode. We define to be an -dimensional error vector of training data on all clients, . Its -th component is . Notice that contains the datapoints from all clients. Similarly we can define to be the error vector of the -th client . is thus the concatenation of all . Also, we can define a projection matrix :

It is straightforward to verify that . As introduced in the main text, we use to denote the weight vector after iterations of local gradient descent. Suppose in each communication round, every client performs steps of gradient descent, then naturally means the weight vector at the beginning of -th communication round of -th age. We use to denote the gradient matrix, whose -th element . Similarly, is a gradient matrix on client , whose -th element is .

We use to denote the constant stepsize of gradient descent in local client training. For NTK parametrization, stepsize should decrease with the increase of network width, hence we rewrite as , where is a new constant independent of neural network width . We use to denote the subset of all clients participating in training of the -th communication round of the -th age. We will assume that in one communication round, the number of datapoints on all clients that participated in training is the same: . This assumption is roughly satisfied in our experiments. With some modifications in the weighting parameters in algorithm 1, the equal-datapoint assumption can be alleviated.

The following lemma is proved as Lemma 1 in [ntkgaussian]:

Lemma 9.1.

Local Lipschitz of Jacobian. If (i) the training set is consistent: for any ,

. (ii) activation function of the neural nets

satisfies , , . (iii) for all input , then there are constants such that for every

, with high probability over random initialization the following holds:

for every i=1,2,…N, and also:

, where

Lemma 9.1 discusses the sufficient conditions for the Lipschitz continuity of Jacobian. In the following derivations, we use the result of Lemma 9.1 directly.

In the proof, we extensively use Taylor expansions over the product of local stepsize and local training epochs . As we will show, the first order term can yield the desired result. We will bound the second order term and neglect higher order terms by incorporating them into an residual. We abuse this notation a little: sometimes denote a constant containing orders higher than , sometimes a vector whose norm is , sometimes an operator whose operator norm is . The exact meaning should be clear depending on the context. Taylor expansion is feasible when is small enough. Extensive experimental results confirm the validity of Taylor expansion treatment.

Lemma 9.2.

For K operators , ,…, if their operator norm is bounded by , respectively, we have:


where is defined as the commutator of operator and : . Here denotes some operator whose operator norm is .

Lemma 9.2 connects the product of exponential to the exponential of sum. This is an extended version of Baker Campbell Hausdorff formula. We still give a simple proof at the end of this section for completeness.

Lemma 9.3 bounds the difference between the average of exponential and exponential of average:

Lemma 9.3.

For K operators , ,…, and K non-negative values satisfying , we have:


The proof is straightforward by using Taylor expansion on both sides and retain only highest order terms in residual. For an operator , its exponential is defined as:


where is defined as:


When ’s operator norm is bounded by , we can also bound the operator norm of as:

Then we can rewrite two sides of the equations using as below:

Since we have shown ’s are bounded, the last term

Proof of Lemma 9.2


We adopt the notations in the proof of lemma 9.3 from (11) and (12). Then we have:

9.2 Training of Linearized Neural Networks

The derivation in this section is inspired by the kernel regression model from [ntk]. Now can be any positive definite symmetric kernel, later we will specify it to be neural tangent kernel. We define the gram operator of client to be a map from function into function

and also, the gram operator of the entire dataset is defined as:

In the beginning of -th communication round of age , is sent to all clients in some selected strata. As discussed before, we use to denote the set of clients that download mode in the -th communication round of age . Since client uses gradient descent to minimize the objective, we approximate it with gradient flow:

where is the local weight vector on client starting from .

Since does not evolve over time, after epochs, the updated function on client is:

where is the ground truth function satisfying for every , i.e. it achieves zero training error on every client. After epochs, all clients send updated model back to server, and server use to calculate new . Since we are training a linear model:

Plugging in the equation of , we can use lemma 9.3 to derive: