Federated Learning (FL) is a form of distributed learning where training is done by edge devices (also called clients) using their local data without sharing their data with the central coordinator (also known as server) or other devices. The only information shared between the clients and the server is the update to the global model after local client training. This helps in preserving the privacy of the local client dataset. Additionally, training on the edge devices does not require datasets to be communicated from the clients to the server, lowering communication costs. The single model federated learning problem has been widely studied from various perspectives including optimizing the frequency of communication between the clients and the server , designing client selection algorithms in the setting where only a subset of devices train the model in each iteration , and measuring the effect of partial device participation on the convergence rate .
Closest to this work, in , the possibility of using federated learning to train multiple models simultaneously using the same set of edge devices has been explored. We refer to this as multi-model FL. In the setting considered in , each device can only train one model at a time. The algorithmic challenge is to determine which model each client will train in any given round. Simulations illustrate that multiple models can indeed be trained simultaneously without a sharp drop in accuracy by splitting the clients into subsets in each round and use one subset to train each of the models. One key limitation of  is the lack of analytical performance guarantees for the proposed algorithms. In this work, we address this limitation of  by proposing algorithms for multi-model FL with provable performance guarantees.
1.1 Our Contributions
In this work, we focus on the task of training models simultaneously using a shared set of clients. We propose two variants of the Fed-Avg algorithm  for the multi-model setting, with provable convergence guarantees. The first variant called Multi-FedAvg-Random (MFA-Rand) partitions clients into groups in each round and matches the groups to the models in a uniformly randomized fashion. Under the second variant called Multi-FedAvg-Round Robin (MFA-RR), time is divided into frames such that each frame consists of rounds. At the beginning of each frame, clients are partitioned into groups uniformly at random. Each group then trains the models in the rounds in the frame in a round-robin manner.
The error for a candidate multi-model federated learning algorithm , for each model is defined as the distance between each model’s global weight at round and its optimizer. Our analytical results show that when the objective function is strongly convex and smooth, for MFA-Rand, an upper bound on the error decays as , and for MFA-RR, the error decays as
. The latter result holds when local stochastic gradient descent at the clients transitions to full gradient descent over time. This allows MFA-RR to be considerably faster than FedCluster, an algorithm similar to MFA-RR.
Further, we study the performance gain in multi-model FL over single-model training under MFA-Rand. We show via theoretical analysis that training multiple models simultaneously can be more efficient than training them sequentially, i.e., training only one model at a time on the same set of clients.
Via synthetic simulations we show that MFA-RR typically outperforms MFA-Rand. Intuitively this is because under MFA-RR, each client trains each model in every frame, while this is not necessary in the MFA-Rand. Our data-driven experiments prove that training multiple models simultaneously is advantageous over training them sequentially.
1.2 Related Work
The federated learning framework and the FedAvg algorithm was first introduced in . Convergence of FedAvg has been studies extensively under convexity assumption [15, 19, 27] and non-convex setting [15, 20]. When data is homogeneous across clients and all clients participate in each round, FedAvg is known as LocalSGD. LocalSGD is easier to analyse because of the aforementioned assumptions. LocalSGD has been proved to be convergent with strictly less communication in . LocalSGD has been proved to be convergent in non-convex setting [8, 28, 25].
handles data heterogeneity among clients and convergences in the setting where the data across clients is non-iid. The key difference between FedAvg and FedProx is that the latter adds an addition proximal term to the loss function. However, the analysis of FedProx requires the proximal term to be present and, therefore, does not serve as proof for convergence of FedAvg.
Personalised FL is an area where multiple local models are trained instead of a global model. The purpose of having multiple models  is to have the local models reflect the specific characteristics of their data. [13, 9, 1] involved using a combination of global and local models. Although there are multiple models, the underlying objective is same. This differentiates personalised FL from our setting, which involves multiple unrelated models.
Clustered Federated Learning proposed in  involved performing FedAvg and bi-partitioning client set in turn until convergence is reached. The idea here was to first perform vanilla FL. Then, recursively bi-partition the set of clients into two clusters and perform within the clusters until a more refined stopping criterion is met. The second step is a post-processing technique.  studies distributed SGD in heterogeneous networks, where distributed SGD is performed within sub-networks, in a decentralized manner, and model averaging happens periodically among neighbouring sub-networks.
FedCluster proposed in  is very similar to second algorithm we propose in this work. In , convergence has been guaranteed in a non-convex setting. However, the advantages of a convex loss function or the effect of data sample size has not been explored in this work. Further, the clusters, here, are fixed throughout the learning process.
, client distribution among models has been approached as an optimization problem and a heuristic algorithm has been proposed. However, neither an optimal algorithm is provided nor any convergence guarantees.
We consider the setting where models are trained simultaneously in a federated fashion. The server has global version of each of the models and the clients have local datasets (possibly heterogeneous) for every model. The total number of clients available for training is denoted by . We consider full device participation with the set of clients divided equally among the models in each round. In every round, each client is assigned exactly one model to train. Each client receives the current global weights of the model it was assigned and performs iterations of stochastic gradient descent locally at a fixed learning rate of (which can change across rounds indexed by ) using the local dataset. It then sends the local update to its model, back to the server. The server then takes an average of the received weight updates and uses it to get the global model for the next round.
|Number of Models|
|Set of all clients|
|Total number of clients =|
|Client selection algorithm|
|trainModel||Model assigned to client|
|globalModel||Global weights of model|
|localModel||Weights of model of client|
The global loss function of the model is denoted by , the local loss function of the model for the client is denoted by . For simplicity, we assume that, for every model, each client has same number of data samples, . We therefore have the following global objective function for model,
All are -strongly convex.
All are -smooth.
Stochastic gradients are bounded: .
For each client k, , where is the loss function of the data point of the client’s model. Then for each ,
for some constants and .
2.1 Performance Metric
The error for a candidate multi-model federated learning algorithm ’s is the distance between the model’s global weight at round , denoted by and the minimizer of , denoted by . Formally,
We consider two variants of the Multi-FedAvg algorithm proposed in . For convenience we assume that is an integral multiple of .
3.1 Multi-FedAvg-Random (MFA-Rand)
The first variant partitions the set of clients into equal sized subsets in every round . The subsets are created uniformly at random independent of all past and future choices. The subsets are then matched to the models with the matching chosen uniformly at random.
|Round||Model 1||Model 2||Model 3|
3.2 Multi-FedAvg-Round Robin (MFA-RR)
The second variant partitions the set of clients into equal sized subsets once every rounds. The subsets are created uniformly at random independent of all past and future choices. We refer to the block of rounds during which the partitions remains unchanged as a frame. Within each frame, each of the subsets is mapped to each model exactly once in a round-robin manner. Specifically, let the subsets created at the beginning of frame be denoted by for . Then, in the round in frame for , is matched to model .
|Round||Model 1||Model 2||Model 3|
An important difference between the two algorithms is that under MFA-RR, each client-model pair is used exactly once in each frame consisting of rounds, whereas under MFA-Rand, a specific client-model pair is not matched in
consecutive time-slots with probability.
4 Convergence of MFA-Rand and MFA-RR
In this section, we present our analytical results for MFA-Rand and MFA-RR.
The result follows from the convergence of FedAvg with partial device participation in . Note that and influence the lower bound on number of iterations, , required to reach a certain accuracy. Increasing the number of models, , increases which increases .
Here, we require that the SGD iterations at clients have data sample converging sub-linearly to the full dataset. For convergence, the only requirement is to be proportional to the above defined .
We observe that , and influence the lower bound on number of iterations, , required to reach a certain accuracy. Increasing the number of models, , increases which increases .
A special case of Theorem 4.2 is when we employ full gradient descent instead of SGD at clients. Naturally, in this case, .
MFA-RR, when viewed from perspective of one of the models, is very similar to FedCluster. However, there are some key differences between the analytical results.
First, FedCluster assumes that SGD at client has fixed bounded variance for any sample size (along with Assumption3). This is different from Assumption 4 of ours. When Assumption 4 is coupled with Assumption 3, we get sample size dependent bound on the variance. A smaller variance is naturally expected for a larger data sample. Therefore, our assumption is less restrictive.
Second, the effect of increasing sample size (or full sample) has not been studied in . We also see the effect of strong convexity on the speed of convergence. The convergence result from  is as follows,
If we apply the strong convexity assumption to this result and use that , we get
Applying Cauchy-Schwartz inequality on LHS we get,
Finally, we have
for any sampling strategy. With an increasing sample size (or full sample size), we can obtain convergence. This is a significant improvement over the convergence result of FedCluster.
5 Advantage of Multi-Model FL over single model FL
We quantify the advantage of Multi-Model FL over single model FL by defining the gain of a candidate multi-model Federated Learning algorithm over FedAvg , which trains only one model at a time.
Let be the number of rounds needed by one of the models using FedAvg to reach an accuracy level (distance of model’s current weight from its optimizer) of . We assume that all models are of similar complexity. This means we expect that each model reaches the required accuracy in roughly the same number of rounds. Therefore, cumulative number of rounds needed to ensure all models reach an accuracy level of using FedAvg is . Further, let the number of rounds needed to reach an accuracy level of for all models under be denoted by . We define the gain of algorithm for a given as
Note that FedAvg and use the same number of clients in each round, thus the comparison is fair. Further, we use the bounds in Theorem 4.1 and Theorem 4.2 as proxies for calculating for MFA-Rand and MFA-RR respectively.
When and , the following holds for the gain of -model MFA-Rand over running FedAvg times
for all and for when .
We get that gain increases with upto , after which we have the case. At , each model is trained by only one client, which is too low, especially when is large.
For the case of , Theorem 5.1 puts a lower bound on . However, the case is rarely used in practice. One of the main advantages of FL is the lower communication cost due to local training. This benefit is not utilised when .
6 Simulations in strongly convex setting
6.1 Simulation Framework
We take inspiration from the framework presented in  where a quadratic loss function, which is strongly convex, is minimized in a federated setting. We employ MFA-Rand and MFA-RR algorithms and compare their performance in this strongly convex setting. In addition to that, we also measure the gain of MFA-Rand and MFA-RR over sequentially running FedAvg in this strongly convex setting.
The global loss function is
where , , and . We have,
Here the pair represents the client. Following is the definition of
where is a matrix where element is 1 and rest all are 0. is defined as follows
and is defined as follows
We finally define local loss function of the client as
which satisfies our problem statement .
We simulate the minimization of a single global loss function while talking about multi-model learning. The reason behind this is that we are doing these simulations from the perspective of one of the models. Therefore, -model MFA-Rand boils down to sampling clients independently every round while -model MFA-RR remains the same (going over subsets in frame ). Furthermore, the gain simulations here assume that all models are of the same complexity.
6.2 Comparison of MFA-Rand and MFA-RR
We consider the scenario where , and . We take , meaning 5 local SGD iterations at clients. We track the log distance of the current global loss from the global loss minimum, that is
for 1000 rounds. We consider both constant and decaying learning rate for and .
As one can observe in Fig. 0(a) and Fig. 1(a), we have similar mean performance for MFA-Rand and MFA-RR. However, Fig. 0(b) and Fig. 1(b) reveal that the randomness involved in MFA-Rand is considerably higher than that in MFA-RR, showing the latter to be more reliable.
It is evident from Fig. 2(a) and Fig. 3(a) that MFA-RR, on an average, performs better than MFA-Rand when is high. Again, Fig.2(b) and Fig. 3(b) show that MFA-Rand has considerably higher variance than MFA-RR.
In this set of simulations, each client performs full gradient descent. While the analytical upper bounds on errors suggest an order-wise difference in the performance of MFA-RR and MFA-Rand, we do not observe that significant a difference between the performance of the two algorithms. This is likely because our analysis of MFA-RR exploits the fact that each client performs full gradient descent, while the analysis of MFA-Rand adapted from  does not.
6.3 Gain vs
We test with clients for for and . Gain vs plots in Fig. 5 show that gain increases with for both MFA-Rand and MFA-RR.
7 Data Driven Experiments on Gain
datasets for these experiments. The learning task in Synthetic(1,1) is multi-class logistic regression classification of feature vectors. Synthetic(1,1)-A involves 60 dimensional feature vectors classified into 5 classes while Synthetic(1,1)-B involves 30 dimensional feature vectors classified into 10 classes. CelebA dataset involves binary classification of face images based on a certain facial attribute, (for example, blond hair, smiling, etc) using convolutional neural networks (CNNs). The dataset has many options for the facial attribute.
The multi-model FL framework for training multiple unrelated models simultaneously was first introduced in our previous work . We use the same framework for these experiments. We first find the gain vs trend for Synthetic(1,1)-A, Synthetic(1,1)-B and CelebA. Then, we simulate a real-world scenarios where each of the models is a different learning task.
7.1 Gain vs
Here, instead of giving different tasks as the models, we have all models as the same underlying task. The framework, however, treats the models as independent of each other. This ensures that the models are of equal complexity.
We plot gain vs for two kinds of scenarios. First, when all clients are being used in a round. Theorem 5.1 assumes this scenario. We call it full device participation as all clients are being used. Second, when only a sample, of the set of entire clients, is selected to be used in the round (and then distributed among the models). We call this partial device participation as a client has a non-zero probability of being idle during a round.
7.1.1 Full device participation:
For Synthetic(1,1)-A, we have clients and . During single model FedAvg, we get 51% training accuracy and 51% test accuracy at the end of 70 rounds.
Synthetic(1,1)-B, we have clients and . During single model FedAvg, we get 42.7% training accuracy and 42.8% test accuracy at the end of 70 rounds.
For CelebA, we have 96 clients and . We get 79% training accuracy and 68% test accuracy at the end of 75 rounds of single model FedAvg.
For full device participation, we observe from Fig. 6 that gain increases with for both Synthetic(1,1) and CelebA datasets with the trend being sub-linear in nature.
7.1.2 Partial device participation:
For Synthetic(1,1)-A, we have clients (out of which 32 are sampled every round) and . During single model FedAvg, we get 61.1% training accuracy and 61.3% test accuracy at the end of 200 rounds.
For Synthetic(1,1)-B, we have clients (out of which 32 are sampled every round) and . During single model FedAvg, we get 58.4% training accuracy and 57.7% test accuracy at the end of 200 rounds.
For CelebA, we have 96 clients (out of which 24 are sampled every round) and . We get 78% training accuracy and 71.5% test accuracy at the end of 75 rounds of single model FedAvg.
When there is partial device participation, for both datasets, we observe in Fig. 7 that gain increases with for the most part while decreasing at some instances. Although, there are some dips, gain is always found to be more than 1.
It is important to note that the learning task in CelebA dataset invloves CNNs, rendering it a non-convex nature. This, however, does not severely impact the gain, as we still observe it to always increase with for full device participation.
Although, Theorem 5.1 assumes full device participation, we see the benefit of multi-model FL in partial device participation scenario. For all three datasets, gain is always found to be significantly greater than 1.
7.2 Real-world Scenarios
We perform two types of real world examples, one involving models that are similar in some aspect and the other involving completely different models. In these experiments, denotes the number of rounds for which single model FedAvg was been run for each model. Further, denotes the number of rounds of multi-model FL, after which each model an accuracy that is at least what was achieved with rounds of single model FedAvg.
7.2.1 Similar models:
Based on the the values of and from Table 4, we have the following for training and testing cases.
Gain in training =
Gain in testing =
|Facial attribute||Train Accuracy||Test Accuracy|
7.2.2 Completely different models:
In the second one, we do a mixed model experiment where one model is logistic regression (Synthetic(1,1) with 60 dimensional vectors into 5 classes) and the other model is CNN (binary classification of face images based on presence of eyeglasses).
Based on and from Table 5, we get the following values of gain for training and testing cases,
Gain in training =
Gain in testing =
|Model||Train Accuracy||Test Accuracy|
In this work, we focus on the problem of using Federated Learning to train multiple independent models simultaneously using a shared pool of clients. We propose two variants of the widely studied FedAvg algorithm, in the multi-model setting, called MFA-Rand and MFA-RR, and show their convergence. In case of MFA-RR, we show that an increasing data sample size (for client side SGD iterations), helps improve the speed of convergence greatly .
Further, we propose a performance metric to access the advantage of multi-model FL over single model FL. We characterize conditions under which running MFA-Rand for
models simultaneously is advantageous over running single model FedAvg for each model sequentially. We perform experiments in strongly convex and convex settings to corroborate our analytical results. By running experiments in a non-convex setting, we see the benefits of multi-model FL in deep learning. We also run experiments that are out of the scope of the proposed setting. These were the partial device participation experiments and the real world scenarios. Here also we see an advantage in training multiple models simultaneously.
Further extensions to this work include theoretical analysis of partial device participation scenarios, and convergence guarantees, if any, for unbiased client selection algorithms  in multi-model FL.
-  (2020) Federated residual learning. arXiv preprint arXiv:2003.12880. Cited by: §1.2.
-  (1996) Neuro-dynamic programming. Athena Scientific. Cited by: §2.
-  (2022) Multi-model federated learning. In 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 779–783. Cited by: §1.2, §1, §3, §7, §8.
-  (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §7.
-  (2020) Multi-level local sgd for heterogeneous hierarchical networks. arXiv preprint arXiv:2007.13819. Cited by: §1.2.
-  (2020) FedCluster: boosting the convergence of federated learning via cluster-cycling. In 2020 IEEE International Conference on Big Data (Big Data), Vol. , pp. 5017–5026. External Links: Cited by: §1.1, §1.2, Remark 2.
-  (2020) Client selection in federated learning: convergence analysis and power-of-choice selection strategies. arXiv preprint arXiv:2010.01243. Cited by: §2.
Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD Thesis. Cited by: §1.2.
-  (2020) Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461. Cited by: §1.2.
Semi-cyclic stochastic gradient descent.
International Conference on Machine Learning, pp. 1764–1773. Cited by: §1.2.
-  (2012) Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing 34 (3), pp. A1380–A1405. Cited by: §2.
-  (2019) Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization 29 (4), pp. 2542–2565. Cited by: §0.A.3.3, §0.A.3.
-  (2020) Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516. Cited by: §1.2.
-  (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14 (1–2), pp. 1–210. Cited by: §1.
-  (2020) A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. Cited by: §1.2.
-  (2021) An efficient multi-model training algorithm for federated learning. In 2021 IEEE Global Communications Conference (GLOBECOM), Vol. , pp. 1–6. External Links: Cited by: §1.2.
-  (2020) Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems 2, pp. 429–450. Cited by: §1.2, §7.
-  (2019) Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497. Cited by: §7.
-  (2019) On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189. Cited by: §0.A.1.1, §0.A.2, §1.2, §1, §2, §4, §6.1, Remark 4.
-  (2019) Communication efficient decentralized training with multiple local updates. arXiv preprint arXiv:1910.09126 5. Cited by: §1.2.
Deep learning face attributes in the wild.
Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: §7.
-  (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §1.1, §1.2, §5.
Clustered federated learning: model-agnostic distributed multitask optimization under privacy constraints.
IEEE transactions on neural networks and learning systems32 (8), pp. 3710–3722. Cited by: §1.2.
-  (2018) Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767. Cited by: §1.2.
-  (2018) Cooperative sgd: a unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576. Cited by: §1.2.
-  (2019) Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1, pp. 212–229. External Links: Cited by: §1.
-  (2020) Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems 33, pp. 6281–6292. Cited by: §1.2.
-  (2017) On the convergence properties of a -step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012. Cited by: §1.2.
Appendix 0.A Appendix
0.a.1 Method of Analysis
We analyze the multi-model algorithms from the perspective of one of the models. Proving convergence for one of the models is enough as Assumptions 1, 2, 3 and 4 hold for all the models. To that end, below is MFA-Rand and MFA-RR from the perspective of one of the models.
In addition to that, we are dropping the time index (or frame index) of the set of subsets during analysis of MFA-Rand (or MFA-RR). This is because we are analysing MFA-Rand over one round and MFA-RR over one frame, during which the set of subsets remain fixed.
From the perspective of one of the models, it is equivalent to sampling clients out of clients, every round. We can, therefore, refer to the analysis of single-model FedAvg with partial device participation in .
A model goes over each of the client-subsets (created at the start of ) exactly once during to . This means that a model goes over each one of exactly once during that frame.
0.a.2 Convergence of MFA-Rand
In , the convergence of FedAvg for partial device participation is stated as,
where is the upper bound on the variance of for any size of . We can the get its value by setting in Lemma 1.
Here, is the number of clients selected per round. Since, we have , we put that in the expression of in  giving us
One important thing to note is that is constant during the iterations in round . However,  has a decreasing step size even during the iterations. This is the reason why one will find a factor of 4 absent in inequality 14 when compared to its counterpart in .
Using Cauchy-Schwartz theorem, we have,
We therefore have,
0.a.3 Convergence of MFA-RR
We start with introducing some new notation only for the purposes of the proof. We first drop the subscript that indicated model number as we need to prove convergence for only one of the models. Below are the notations used frequently in the proof. Some of them are adopted from .
: frame number
: round in current frame
Local iteration: stochastic gradient descent iteration at local device
Global iteration: stochastic centralized gradient descent iteration (virtual)
: subset of clients to be used in the round of a frame (this may differ across frames but we analyse MFA-RR over a single frame and hence, do not index it by frame number)
: global weight (subscript dropped) at
: global weight vector (virtual) of centralized full GD iteration from
: local weight vector of local SGD iteration of client. Since a client is used by a model exactly once in a frame, there is no need to index this variable by the round number.
: learning rate for all rounds in frame.
The local update rule is
where . Therefore, the update at the client is
The global update rule involves summing weight updates from clients in and multiplying a factor of to it.
Over one frame, the above expression evaluates to
Now we will compare this with global iterations of centralized GD,
We define error as,
And we define as
We now track the expected distance between and . Subtracting on both sides of the above equation and taking expectation of norm, we get