1 Introduction
Federated Learning (FL) is a form of distributed learning where training is done by edge devices (also called clients) using their local data without sharing their data with the central coordinator (also known as server) or other devices. The only information shared between the clients and the server is the update to the global model after local client training. This helps in preserving the privacy of the local client dataset. Additionally, training on the edge devices does not require datasets to be communicated from the clients to the server, lowering communication costs. The single model federated learning problem has been widely studied from various perspectives including optimizing the frequency of communication between the clients and the server [26], designing client selection algorithms in the setting where only a subset of devices train the model in each iteration [14], and measuring the effect of partial device participation on the convergence rate [19].
Closest to this work, in [3], the possibility of using federated learning to train multiple models simultaneously using the same set of edge devices has been explored. We refer to this as multimodel FL. In the setting considered in [3], each device can only train one model at a time. The algorithmic challenge is to determine which model each client will train in any given round. Simulations illustrate that multiple models can indeed be trained simultaneously without a sharp drop in accuracy by splitting the clients into subsets in each round and use one subset to train each of the models. One key limitation of [3] is the lack of analytical performance guarantees for the proposed algorithms. In this work, we address this limitation of [3] by proposing algorithms for multimodel FL with provable performance guarantees.
1.1 Our Contributions
In this work, we focus on the task of training models simultaneously using a shared set of clients. We propose two variants of the FedAvg algorithm [22] for the multimodel setting, with provable convergence guarantees. The first variant called MultiFedAvgRandom (MFARand) partitions clients into groups in each round and matches the groups to the models in a uniformly randomized fashion. Under the second variant called MultiFedAvgRound Robin (MFARR), time is divided into frames such that each frame consists of rounds. At the beginning of each frame, clients are partitioned into groups uniformly at random. Each group then trains the models in the rounds in the frame in a roundrobin manner.
The error for a candidate multimodel federated learning algorithm , for each model is defined as the distance between each model’s global weight at round and its optimizer. Our analytical results show that when the objective function is strongly convex and smooth, for MFARand, an upper bound on the error decays as , and for MFARR, the error decays as
. The latter result holds when local stochastic gradient descent at the clients transitions to full gradient descent over time. This allows MFARR to be considerably faster than FedCluster
[6], an algorithm similar to MFARR.Further, we study the performance gain in multimodel FL over singlemodel training under MFARand. We show via theoretical analysis that training multiple models simultaneously can be more efficient than training them sequentially, i.e., training only one model at a time on the same set of clients.
Via synthetic simulations we show that MFARR typically outperforms MFARand. Intuitively this is because under MFARR, each client trains each model in every frame, while this is not necessary in the MFARand. Our datadriven experiments prove that training multiple models simultaneously is advantageous over training them sequentially.
1.2 Related Work
The federated learning framework and the FedAvg algorithm was first introduced in [22]. Convergence of FedAvg has been studies extensively under convexity assumption [15, 19, 27] and nonconvex setting [15, 20]. When data is homogeneous across clients and all clients participate in each round, FedAvg is known as LocalSGD. LocalSGD is easier to analyse because of the aforementioned assumptions. LocalSGD has been proved to be convergent with strictly less communication in [24]. LocalSGD has been proved to be convergent in nonconvex setting [8, 28, 25].
FedProx [17]
handles data heterogeneity among clients and convergences in the setting where the data across clients is noniid. The key difference between FedAvg and FedProx is that the latter adds an addition proximal term to the loss function. However, the analysis of FedProx requires the proximal term to be present and, therefore, does not serve as proof for convergence of FedAvg.
Personalised FL is an area where multiple local models are trained instead of a global model. The purpose of having multiple models [10] is to have the local models reflect the specific characteristics of their data. [13, 9, 1] involved using a combination of global and local models. Although there are multiple models, the underlying objective is same. This differentiates personalised FL from our setting, which involves multiple unrelated models.
Clustered Federated Learning proposed in [23] involved performing FedAvg and bipartitioning client set in turn until convergence is reached. The idea here was to first perform vanilla FL. Then, recursively bipartition the set of clients into two clusters and perform within the clusters until a more refined stopping criterion is met. The second step is a postprocessing technique. [5] studies distributed SGD in heterogeneous networks, where distributed SGD is performed within subnetworks, in a decentralized manner, and model averaging happens periodically among neighbouring subnetworks.
FedCluster proposed in [6] is very similar to second algorithm we propose in this work. In [6], convergence has been guaranteed in a nonconvex setting. However, the advantages of a convex loss function or the effect of data sample size has not been explored in this work. Further, the clusters, here, are fixed throughout the learning process.
Training multiple models in a federated setting has been explored in [16] also. However, the setting and the model training methodology are different from ours [3]. In [16]
, client distribution among models has been approached as an optimization problem and a heuristic algorithm has been proposed. However, neither an optimal algorithm is provided nor any convergence guarantees.
2 Setting
We consider the setting where models are trained simultaneously in a federated fashion. The server has global version of each of the models and the clients have local datasets (possibly heterogeneous) for every model. The total number of clients available for training is denoted by . We consider full device participation with the set of clients divided equally among the models in each round. In every round, each client is assigned exactly one model to train. Each client receives the current global weights of the model it was assigned and performs iterations of stochastic gradient descent locally at a fixed learning rate of (which can change across rounds indexed by ) using the local dataset. It then sends the local update to its model, back to the server. The server then takes an average of the received weight updates and uses it to get the global model for the next round.
Algorithm 1 details the training process of multimodel FL with any client selection algorithm . Table 1 lists out the variables used in the training process and their purpose.
Variable Name  Description 

round number  
Number of Models  
Set of all clients  
Total number of clients =  
Client selection algorithm  
trainModel[]  Model assigned to client 
globalModel[]  Global weights of model 
localModel[]  Weights of model of client 
The global loss function of the model is denoted by , the local loss function of the model for the client is denoted by . For simplicity, we assume that, for every model, each client has same number of data samples, . We therefore have the following global objective function for model,
(1) 
Additionally, . We make some standard assumptions on the local loss function, also used in [7, 19]. These are,
Assumption 1
All are strongly convex.
Assumption 2
All are smooth.
Assumption 3
Stochastic gradients are bounded: .
Assumption 4
For each client k, , where is the loss function of the data point of the client’s model. Then for each ,
for some constants and .
2.1 Performance Metric
The error for a candidate multimodel federated learning algorithm ’s is the distance between the model’s global weight at round , denoted by and the minimizer of , denoted by . Formally,
(2) 
3 Algorithms
We consider two variants of the MultiFedAvg algorithm proposed in [3]. For convenience we assume that is an integral multiple of .
3.1 MultiFedAvgRandom (MFARand)
The first variant partitions the set of clients into equal sized subsets in every round . The subsets are created uniformly at random independent of all past and future choices. The subsets are then matched to the models with the matching chosen uniformly at random.
Algorithm 2 details the subprocess of MFARand invoked when clientmodel assignment step (step 6) runs in Algorithm 1. An example involving 3 models over 6 rounds has been worked out in Table 2.
Round  Model 1  Model 2  Model 3 

1  
2  
3  
4  
5  
6 
3.2 MultiFedAvgRound Robin (MFARR)
The second variant partitions the set of clients into equal sized subsets once every rounds. The subsets are created uniformly at random independent of all past and future choices. We refer to the block of rounds during which the partitions remains unchanged as a frame. Within each frame, each of the subsets is mapped to each model exactly once in a roundrobin manner. Specifically, let the subsets created at the beginning of frame be denoted by for . Then, in the round in frame for , is matched to model .
Algorithm 3 details the subprocess of MFARR invoked when clientmodel assignment step (step 6) runs in Algorithm 1. An example involving 3 models over 6 rounds has been worked out in Table 3.
Round  Model 1  Model 2  Model 3 

1  
2  
3  
4  
5  
6 
Remark 1
An important difference between the two algorithms is that under MFARR, each clientmodel pair is used exactly once in each frame consisting of rounds, whereas under MFARand, a specific clientmodel pair is not matched in
consecutive timeslots with probability
.4 Convergence of MFARand and MFARR
In this section, we present our analytical results for MFARand and MFARR.
The result follows from the convergence of FedAvg with partial device participation in [19]. Note that and influence the lower bound on number of iterations, , required to reach a certain accuracy. Increasing the number of models, , increases which increases .
Theorem 4.2
Here, we require that the SGD iterations at clients have data sample converging sublinearly to the full dataset. For convergence, the only requirement is to be proportional to the above defined .
We observe that , and influence the lower bound on number of iterations, , required to reach a certain accuracy. Increasing the number of models, , increases which increases .
A special case of Theorem 4.2 is when we employ full gradient descent instead of SGD at clients. Naturally, in this case, .
Corollary 1
Remark 2
MFARR, when viewed from perspective of one of the models, is very similar to FedCluster. However, there are some key differences between the analytical results.
First, FedCluster assumes that SGD at client has fixed bounded variance for any sample size (along with Assumption
3). This is different from Assumption 4 of ours. When Assumption 4 is coupled with Assumption 3, we get sample size dependent bound on the variance. A smaller variance is naturally expected for a larger data sample. Therefore, our assumption is less restrictive.Second, the effect of increasing sample size (or full sample) has not been studied in [6]. We also see the effect of strong convexity on the speed of convergence. The convergence result from [6] is as follows,
If we apply the strong convexity assumption to this result and use that , we get
Applying CauchySchwartz inequality on LHS we get,
Finally, we have
for any sampling strategy. With an increasing sample size (or full sample size), we can obtain convergence. This is a significant improvement over the convergence result of FedCluster.
5 Advantage of MultiModel FL over single model FL
We quantify the advantage of MultiModel FL over single model FL by defining the gain of a candidate multimodel Federated Learning algorithm over FedAvg [22], which trains only one model at a time.
Let be the number of rounds needed by one of the models using FedAvg to reach an accuracy level (distance of model’s current weight from its optimizer) of . We assume that all models are of similar complexity. This means we expect that each model reaches the required accuracy in roughly the same number of rounds. Therefore, cumulative number of rounds needed to ensure all models reach an accuracy level of using FedAvg is . Further, let the number of rounds needed to reach an accuracy level of for all models under be denoted by . We define the gain of algorithm for a given as
(3) 
Note that FedAvg and use the same number of clients in each round, thus the comparison is fair. Further, we use the bounds in Theorem 4.1 and Theorem 4.2 as proxies for calculating for MFARand and MFARR respectively.
Theorem 5.1
When and , the following holds for the gain of model MFARand over running FedAvg times
for all and for when .
We get that gain increases with upto , after which we have the case. At , each model is trained by only one client, which is too low, especially when is large.
For the case of , Theorem 5.1 puts a lower bound on . However, the case is rarely used in practice. One of the main advantages of FL is the lower communication cost due to local training. This benefit is not utilised when .
6 Simulations in strongly convex setting
6.1 Simulation Framework
We take inspiration from the framework presented in [19] where a quadratic loss function, which is strongly convex, is minimized in a federated setting. We employ MFARand and MFARR algorithms and compare their performance in this strongly convex setting. In addition to that, we also measure the gain of MFARand and MFARR over sequentially running FedAvg in this strongly convex setting.
The global loss function is
(4) 
where , , and . We have,
(5) 
(6) 
Here the pair represents the client. Following is the definition of
(7) 
where is a matrix where element is 1 and rest all are 0. is defined as follows
and is defined as follows
(8) 
We finally define local loss function of the client as
(9) 
which satisfies our problem statement .
Remark 3
We simulate the minimization of a single global loss function while talking about multimodel learning. The reason behind this is that we are doing these simulations from the perspective of one of the models. Therefore, model MFARand boils down to sampling clients independently every round while model MFARR remains the same (going over subsets in frame ). Furthermore, the gain simulations here assume that all models are of the same complexity.
6.2 Comparison of MFARand and MFARR
We consider the scenario where , and . We take , meaning 5 local SGD iterations at clients. We track the log distance of the current global loss from the global loss minimum, that is
(10) 
for 1000 rounds. We consider both constant and decaying learning rate for and .
As one can observe in Fig. 0(a) and Fig. 1(a), we have similar mean performance for MFARand and MFARR. However, Fig. 0(b) and Fig. 1(b) reveal that the randomness involved in MFARand is considerably higher than that in MFARR, showing the latter to be more reliable.
It is evident from Fig. 2(a) and Fig. 3(a) that MFARR, on an average, performs better than MFARand when is high. Again, Fig.2(b) and Fig. 3(b) show that MFARand has considerably higher variance than MFARR.
Remark 4
In this set of simulations, each client performs full gradient descent. While the analytical upper bounds on errors suggest an orderwise difference in the performance of MFARR and MFARand, we do not observe that significant a difference between the performance of the two algorithms. This is likely because our analysis of MFARR exploits the fact that each client performs full gradient descent, while the analysis of MFARand adapted from [19] does not.
6.3 Gain vs
We test with clients for for and . Gain vs plots in Fig. 5 show that gain increases with for both MFARand and MFARR.
7 Data Driven Experiments on Gain
We use Synthetic(1,1) [17, 18] and CelebA [4][21]
datasets for these experiments. The learning task in Synthetic(1,1) is multiclass logistic regression classification of feature vectors. Synthetic(1,1)A involves 60 dimensional feature vectors classified into 5 classes while Synthetic(1,1)B involves 30 dimensional feature vectors classified into 10 classes. CelebA dataset involves binary classification of face images based on a certain facial attribute, (for example, blond hair, smiling, etc) using convolutional neural networks (CNNs). The dataset has many options for the facial attribute.
The multimodel FL framework for training multiple unrelated models simultaneously was first introduced in our previous work [3]. We use the same framework for these experiments. We first find the gain vs trend for Synthetic(1,1)A, Synthetic(1,1)B and CelebA. Then, we simulate a realworld scenarios where each of the models is a different learning task.
7.1 Gain vs
Here, instead of giving different tasks as the models, we have all models as the same underlying task. The framework, however, treats the models as independent of each other. This ensures that the models are of equal complexity.
We plot gain vs for two kinds of scenarios. First, when all clients are being used in a round. Theorem 5.1 assumes this scenario. We call it full device participation as all clients are being used. Second, when only a sample, of the set of entire clients, is selected to be used in the round (and then distributed among the models). We call this partial device participation as a client has a nonzero probability of being idle during a round.
7.1.1 Full device participation:
For Synthetic(1,1)A, we have clients and . During single model FedAvg, we get 51% training accuracy and 51% test accuracy at the end of 70 rounds.
Synthetic(1,1)B, we have clients and . During single model FedAvg, we get 42.7% training accuracy and 42.8% test accuracy at the end of 70 rounds.
For CelebA, we have 96 clients and . We get 79% training accuracy and 68% test accuracy at the end of 75 rounds of single model FedAvg.
For full device participation, we observe from Fig. 6 that gain increases with for both Synthetic(1,1) and CelebA datasets with the trend being sublinear in nature.
7.1.2 Partial device participation:
For Synthetic(1,1)A, we have clients (out of which 32 are sampled every round) and . During single model FedAvg, we get 61.1% training accuracy and 61.3% test accuracy at the end of 200 rounds.
For Synthetic(1,1)B, we have clients (out of which 32 are sampled every round) and . During single model FedAvg, we get 58.4% training accuracy and 57.7% test accuracy at the end of 200 rounds.
For CelebA, we have 96 clients (out of which 24 are sampled every round) and . We get 78% training accuracy and 71.5% test accuracy at the end of 75 rounds of single model FedAvg.
When there is partial device participation, for both datasets, we observe in Fig. 7 that gain increases with for the most part while decreasing at some instances. Although, there are some dips, gain is always found to be more than 1.
Remark 5
It is important to note that the learning task in CelebA dataset invloves CNNs, rendering it a nonconvex nature. This, however, does not severely impact the gain, as we still observe it to always increase with for full device participation.
Remark 6
Although, Theorem 5.1 assumes full device participation, we see the benefit of multimodel FL in partial device participation scenario. For all three datasets, gain is always found to be significantly greater than 1.
7.2 Realworld Scenarios
We perform two types of real world examples, one involving models that are similar in some aspect and the other involving completely different models. In these experiments, denotes the number of rounds for which single model FedAvg was been run for each model. Further, denotes the number of rounds of multimodel FL, after which each model an accuracy that is at least what was achieved with rounds of single model FedAvg.
7.2.1 Similar models:
First one tests Theorem 5.1 where each of the models is a binary image classification based on a unique facial attribute, using CelebA dataset. Table 4 shows the results of our experiment.
Based on the the values of and from Table 4, we have the following for training and testing cases.

Gain in training =

Gain in testing =
Facial attribute  Train Accuracy  Test Accuracy  

for classification  
Eyeglasses  74.7  84.6  69.1  71.9 
Baldness  74.0  74.5  66.8  69.4 
Goatee  74.3  83.5  64.7  68.7 
Wearing Necklace  73.3  80.3  66.2  72.6 
Smiling  72.0  78.7  76.0  79.7 
Moustache  74.1  82.1  65.4  72.1 
Male  77.1  85.0  58.7  63.5 
Wearing Lipstick  75.4  83.8  65.9  72.7 
Double Chin  75.3  83.9  63.9  69.3 
7.2.2 Completely different models:
In the second one, we do a mixed model experiment where one model is logistic regression (Synthetic(1,1) with 60 dimensional vectors into 5 classes) and the other model is CNN (binary classification of face images based on presence of eyeglasses).
Based on and from Table 5, we get the following values of gain for training and testing cases,

Gain in training =

Gain in testing =
Model  Train Accuracy  Test Accuracy  

Type  
Logistic Regression  51.9  52.4  52.8  55.6 
Convolutional NN  86.7  87.2  75.8  77.5 
8 Conclusions
In this work, we focus on the problem of using Federated Learning to train multiple independent models simultaneously using a shared pool of clients. We propose two variants of the widely studied FedAvg algorithm, in the multimodel setting, called MFARand and MFARR, and show their convergence. In case of MFARR, we show that an increasing data sample size (for client side SGD iterations), helps improve the speed of convergence greatly .
Further, we propose a performance metric to access the advantage of multimodel FL over single model FL. We characterize conditions under which running MFARand for
models simultaneously is advantageous over running single model FedAvg for each model sequentially. We perform experiments in strongly convex and convex settings to corroborate our analytical results. By running experiments in a nonconvex setting, we see the benefits of multimodel FL in deep learning. We also run experiments that are out of the scope of the proposed setting. These were the partial device participation experiments and the real world scenarios. Here also we see an advantage in training multiple models simultaneously.
Further extensions to this work include theoretical analysis of partial device participation scenarios, and convergence guarantees, if any, for unbiased client selection algorithms [3] in multimodel FL.
References
 [1] (2020) Federated residual learning. arXiv preprint arXiv:2003.12880. Cited by: §1.2.
 [2] (1996) Neurodynamic programming. Athena Scientific. Cited by: §2.
 [3] (2022) Multimodel federated learning. In 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 779–783. Cited by: §1.2, §1, §3, §7, §8.
 [4] (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: §7.
 [5] (2020) Multilevel local sgd for heterogeneous hierarchical networks. arXiv preprint arXiv:2007.13819. Cited by: §1.2.
 [6] (2020) FedCluster: boosting the convergence of federated learning via clustercycling. In 2020 IEEE International Conference on Big Data (Big Data), Vol. , pp. 5017–5026. External Links: Document Cited by: §1.1, §1.2, Remark 2.
 [7] (2020) Client selection in federated learning: convergence analysis and powerofchoice selection strategies. arXiv preprint arXiv:2010.01243. Cited by: §2.

[8]
(2015)
Iterative parameter mixing for distributed largemargin training of structured predictors for natural language processing
. PhD Thesis. Cited by: §1.2.  [9] (2020) Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461. Cited by: §1.2.

[10]
(2019)
Semicyclic stochastic gradient descent.
In
International Conference on Machine Learning
, pp. 1764–1773. Cited by: §1.2.  [11] (2012) Hybrid deterministicstochastic methods for data fitting. SIAM Journal on Scientific Computing 34 (3), pp. A1380–A1405. Cited by: §2.
 [12] (2019) Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization 29 (4), pp. 2542–2565. Cited by: §0.A.3.3, §0.A.3.
 [13] (2020) Federated learning of a mixture of global and local models. arXiv preprint arXiv:2002.05516. Cited by: §1.2.
 [14] (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14 (1–2), pp. 1–210. Cited by: §1.
 [15] (2020) A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. Cited by: §1.2.
 [16] (2021) An efficient multimodel training algorithm for federated learning. In 2021 IEEE Global Communications Conference (GLOBECOM), Vol. , pp. 1–6. External Links: Document Cited by: §1.2.
 [17] (2020) Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems 2, pp. 429–450. Cited by: §1.2, §7.
 [18] (2019) Fair resource allocation in federated learning. arXiv preprint arXiv:1905.10497. Cited by: §7.
 [19] (2019) On the convergence of fedavg on noniid data. arXiv preprint arXiv:1907.02189. Cited by: §0.A.1.1, §0.A.2, §1.2, §1, §2, §4, §6.1, Remark 4.
 [20] (2019) Communication efficient decentralized training with multiple local updates. arXiv preprint arXiv:1910.09126 5. Cited by: §1.2.

[21]
(2015)
Deep learning face attributes in the wild.
In
Proceedings of the IEEE international conference on computer vision
, pp. 3730–3738. Cited by: §7.  [22] (2017) Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §1.1, §1.2, §5.

[23]
(2020)
Clustered federated learning: modelagnostic distributed multitask optimization under privacy constraints.
IEEE transactions on neural networks and learning systems
32 (8), pp. 3710–3722. Cited by: §1.2.  [24] (2018) Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767. Cited by: §1.2.
 [25] (2018) Cooperative sgd: a unified framework for the design and analysis of communicationefficient sgd algorithms. arXiv preprint arXiv:1808.07576. Cited by: §1.2.
 [26] (2019) Adaptive communication strategies to achieve the best errorruntime tradeoff in localupdate sgd. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1, pp. 212–229. External Links: Link Cited by: §1.
 [27] (2020) Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems 33, pp. 6281–6292. Cited by: §1.2.
 [28] (2017) On the convergence properties of a step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012. Cited by: §1.2.
Appendix 0.A Appendix
0.a.1 Method of Analysis
We analyze the multimodel algorithms from the perspective of one of the models. Proving convergence for one of the models is enough as Assumptions 1, 2, 3 and 4 hold for all the models. To that end, below is MFARand and MFARR from the perspective of one of the models.
In addition to that, we are dropping the time index (or frame index) of the set of subsets during analysis of MFARand (or MFARR). This is because we are analysing MFARand over one round and MFARR over one frame, during which the set of subsets remain fixed.
0.a.1.1 MFARand:
From the perspective of one of the models, it is equivalent to sampling clients out of clients, every round. We can, therefore, refer to the analysis of singlemodel FedAvg with partial device participation in [19].
0.a.1.2 MfaRr:
A model goes over each of the clientsubsets (created at the start of ) exactly once during to . This means that a model goes over each one of exactly once during that frame.
0.a.2 Convergence of MFARand
In [19], the convergence of FedAvg for partial device participation is stated as,
(11) 
where,
(12) 
where,
(13) 
where is the upper bound on the variance of for any size of . We can the get its value by setting in Lemma 1.
Here, is the number of clients selected per round. Since, we have , we put that in the expression of in [19] giving us
(14) 
One important thing to note is that is constant during the iterations in round . However, [19] has a decreasing step size even during the iterations. This is the reason why one will find a factor of 4 absent in inequality 14 when compared to its counterpart in [19].
Using CauchySchwartz theorem, we have,
(15) 
We therefore have,
(16) 
0.a.3 Convergence of MFARR
We start with introducing some new notation only for the purposes of the proof. We first drop the subscript that indicated model number as we need to prove convergence for only one of the models. Below are the notations used frequently in the proof. Some of them are adopted from [12].

: frame number

: round in current frame

Local iteration: stochastic gradient descent iteration at local device

Global iteration: stochastic centralized gradient descent iteration (virtual)

: subset of clients to be used in the round of a frame (this may differ across frames but we analyse MFARR over a single frame and hence, do not index it by frame number)

: global weight (subscript dropped) at

: global weight vector (virtual) of centralized full GD iteration from

: local weight vector of local SGD iteration of client. Since a client is used by a model exactly once in a frame, there is no need to index this variable by the round number.

: learning rate for all rounds in frame.
The local update rule is
(17) 
where . Therefore, the update at the client is
(18) 
The global update rule involves summing weight updates from clients in and multiplying a factor of to it.
(19) 
Over one frame, the above expression evaluates to
(20) 
Now we will compare this with global iterations of centralized GD,
(21) 
where and,
(22) 
So,
(23) 
We define error as,
(24)  
(25)  
(26) 
And we define as
(27) 
Therefore,
(28) 
We now track the expected distance between and . Subtracting on both sides of the above equation and taking expectation of norm, we get
(29) 
We state lemmas 1, 2, 3 and 4 and use them to prove Theorem 4.2. Proofs of lemmas can be found after proof of Theorem 4.2.