FL_Converge
None
view repo
One of the biggest challenges in Federated Learning (FL) is that client devices often have drastically different computation and communication resources for local updates. To this end, recent research efforts have focused on training heterogeneous local models obtained by pruning a shared global model. Despite empirical success, theoretical guarantees on convergence remain an open question. In this paper, we present a unifying framework for heterogeneous FL algorithms with arbitrary adaptive online model pruning and provide a general convergence analysis. In particular, we prove that under certain sufficient conditions and on both IID and nonIID data, these algorithms converges to a stationary point of standard FL for general smooth cost functions, with a convergence rate of O(1/√(Q)). Moreover, we illuminate two key factors impacting convergence: pruninginduced noise and minimum coverage index, advocating a joint design of local pruning masks for efficient training.
READ FULL TEXT VIEW PDFNone
Federated Learning (FL) allows distributed clients to collaborate and train a centralized global model without the transmission of local data. In practice, mobile and edge devices that are equipped with drastically different computation and communication capabilities are becoming the dominant source for FL [lim2020federated]. This has prompted significant recent attention to a family of FL algorithms focusing on training heterogeneous local models (often obtained through pruning a global model). It includes algorithms like HeteroFL [diao2021heterofl] that employ heterogeneous local models with fixed structures, algorithms utilizing pretrained local models like [frankle2019lottery], as well as algorithms like PruneFL [jiang2020model] that update local models adaptively during training. However, the success of these algorithms has only been demonstrated empirically (e.g., [diao2021heterofl, jiang2020model]). Unlike standard FL that has received rigorous theoretical analysis [wang2018cooperative, bonawitz2019towards, yu2019parallel, convergenceNoniid], the convergence of heterogeneous FL with adaptive online model pruning is still an open question. Little is known about whether such algorithms converge to a solution of standard FL.
To answer these questions, in this paper we present a unifying framework for heterogeneous FL algorithms with arbitrary adaptive online model pruning and provide a general convergence analysis. There have been many existing efforts in establishing convergence guarantees for FL algorithms, such as the popular FedAvg [fedavg], on both IID and nonIID ^{1}^{1}1Throughout this paper, “nonIID data” means that the data among local clients are not independent and identically distributed.data distributions, but all rely on the assumption that there can only exist one uniform structure on all client devices. By considering arbitrary pruning strategies in our framework, we formally establish the convergence conditions for a general family of FL algorithms with both (i) heterogeneous local models to accommodate different resource constraints on client devices and (ii) timevarying local models to continuously refine pruning results during training. We prove that these FL algorithms with arbitrary pruning strategies satisfying certain sufficient conditions can indeed converge (at a speed of , where is the number of communication rounds) to a stationary point of standard FL for general smooth cost functions.
To the best of our knowledge, this is the first convergence analysis for heterogeneous FL with arbitrary adaptive online model pruning. The framework captures a number of existing FL algorithms as important special cases and provide a general convergence guarantee to them, including HeteroFL [diao2021heterofl] that employs fixedstructure local models, PruneFL [jiang2020model] that requires periodically training a fullsize model, and SGaP [ma2021effective] that can be viewed as a singleclient version. Moreover, we show that the convergence gap is affected by both pruninginduced noise (i.e., modeled through a constant ) and a new notion of minimum coverage index (i.e., any parameters in the global model are covered in at least local models). In particular, it advocates a joint design of efficient localmodel pruning strategies (e.g., leveraging [wen2016learning, li2016pruning, ciresan2011flexible]) for efficient training. Our results provides a solid theoretical support for designing heterogeneous FL algorithms with efficient pruning strategies, while ensuring similar convergence as standard FL.
We carried out extensive experiments on two datasets which suggest that for a given level of model sparsity, client models should also consider the maximization of the coverage index rather than only keeping the largest parameters through pruning. As an example, a federated learning network with 85% sparsity obtained via our design to maximize converge index achieves up to 8% of improvement compared to the network generated by pruning with the identical model architecture without posing any additional computation overhead.
In summary, our paper makes the following key contributions:
We propose a unifying framework for heterogeneous FL with arbitrary adaptive online model pruning. It captures a number of existing algorithms (whose success are empirically demonstrated) as special cases and allows convergence analysis.
The general convergence of these algorithms are established. On both IID and nonIID data, we prove that under standard assumptions and certain sufficient conditions on pruning strategy, the algorithms converge to a stationary point of standard FL for smooth cost functions.
We further analyze the impact of key factors contributing to the convergence and further advocate a joint design of local pruning masks with respect to both pruninginduced error a notion of minimum coverage index. The results are validated on MNIST and CIFAR10 datasets.
Standard Federated Learning A standard Federated Learning problem considers a distributed optimization for N clients:
(1) 
Here is as set of trainable weights/parameters, is a cost function defined on data set
with respect to a user specified loss function
, and is the weight for the th client such that and .The FL procedure, e.g., FedAvg [fedavg]
, typically consists of a sequence of stochastic gradient descent steps performed distributedly on each local objective, followed by a central step collecting the workers’ updated local parameters and computing an aggregated global parameter. For the
th round of training, first the central server broadcasts the latest global model parameters to clients , who performs local updates as follows:where is the local learning rate. After all available clients have concluded their local updates (in epochs), the server will aggregate parameters from them and generate the new global model for the next round, i.e.,
The formulation captures FL with both IID and nonIID data distributions.
Model Pruning
. Model pruning via weights and connections pruning is one of the promising methods to enable efficient neural networks by setting the proportion of weights and biases to zero and thus bringing reduction to both computation and memory usage. Most works on weights pruning require 3 phases of training: pretraining phase, pruning to sparse phase, and finetune phase. For a neural network
with parameters and input data . The pruning process takes as input and generates a new model , where is a binary mask to denote certain parameters to be set to zero and denotes elementwise multiplication. The pruning mask is computed from a certain pruning policy, e.g., layerwise parameter pruning removing weights below certain percentile and neuron pruning removing neurons with small average weights. We use
to denote the pruned model, which has a reduced model size and is more efficient for communication and training.Federated Averaging and Communication Efficient FL. FedAvg [fedavg] is considered the first and the most commonly used federated learning algorithm, where for each round of training local clients trains using their own data, with their parameters averaged at the central server. FedAvg is able to reduce communication costs by training clients for multiple rounds locally. Several works have shown the convergence of FedAvg under several different settings with both homogeneous (IID) data [wang2018cooperative, woodworth2018graph] and heterogeneous (nonIID) data [convergenceNoniid, bonawitz2019towards, yu2019parallel] even with partial clients participation. Specifically, [yu2019parallel] demonstrated LocalSGD achieves convergence for nonconvex optimization and [convergenceNoniid] established a convergence rate of for strongly convex problems on FedAvg, where Q is the number of SGDs and N is the number of participated clients. Several works [karimireddy2020scaffold, wang2019adaptive, wang2019adaptive22]are proposed to further reduce the communication costs. One direction is to use data compression such as quantization [konevcny2016federated, bonawitz2019towards, mao2021communication, yao2021fedhm], sketching [alistarh2017qsgd, ivkin2019communication], split learning [thapa2020splitfed] and learning with gradient sparsity [han2020adaptive]. This type of work does not consider computation efficiency.
Neural Network Pruning and Sparsification. To reduce the computation costs of a neural network, neural network pruning is a popular research topic. A magnitudebased prunefromdense methodology [han2015learning, guo2016dynamic, yu2018nisp, liu2018rethinking, real2019regularized] is widely used where weights smaller than certain preset threshold are removed from the network. In addition, there are oneshot pruning initialization [lee2018snip], iterative pruning approach [zhu2017prune, narang2017exploring] and adaptive pruning approach [lin2020dynamic, ma2021effective] that allows network to grow and prune. In [frankle2019lottery, morcos2019one] a ”lottery ticket hypothesis” was proposed that with an optimal substructure of the neural network acquired by weights pruning directly train a pruned model could reach similar results as pruning a pretrained network. The other direction is through sparse mask exploration [bellec2017deep, mostafa2019parameter, evci2020rigging]
, where a sparsity in neural networks are maintained during the training process, while the fraction of the weights is explored based on random or heuristics methods.
[frankle2019lottery, mostafa2019parameter]empirically observed training of models with static sparse parameters will converge to a solution with higher loss than models with dynamic sparse training. Note that the efficient sparse matrix multiplication sometimes requires special libraries or hardware, e.g. the sparse tensor cores in NVIDIA A100 GPU, to achieve the actual reduction in memory footprint and computational resources.
Efficient FL with Heterogeneous Neural Networks. Several works are proposed to address the reduction of both computation and communication costs, including one way to utilize lossy compression and dropout techniques[caldas2018expanding, xu2019elfish]. Although early works mainly assume that all local models are to share the same architecture as the global model [li2020federated], recent works have empirically demonstrated that federated learning with heterogeneous client model to save both computation and communication is feasible. PruneFL[jiang2020model] proposed an approach with adaptive parameter pruning during federated learning. [li2021fedmask] proposed federated learning with a personalized and structured sparse mask. HetroFL[diao2021heterofl]
proposed to generate heterogeneous local models as a subnet of the global network by picking the leading continuous parameters layerwise with the help of proposed static batch normalization, while
[li2021hermes] finds the small subnetwork by applying the structured pruning. Despite their empirical success, they lack theoretical convergence guarantees even in convex optimization settings.We rigorously analyze the convergence of heterogeneous FL under arbitrary adaptive online model pruning and establish the conditions for converging to a stationary point of standard FL with general smooth cost functions. The theory results in this paper not only illuminate key convergence properties but also provide a solid support for designing adaptive pruning strategies in heterogeneous FL algorithms.
In this paper, we focus on a family of FL algorithms that leverage adaptive online model pruning to train heterogeneous local models on distributed clients. By considering arbitrary pruning strategies in our formulation, it relaxes a number of key limitations in standard FL: (i) Pruning masks are allowed to be timevarying, enabling online adjustment of pruned local models during the entire training process. (ii) The pruning strategies may vary for different clients, making it possible to optimize the pruned local models with respect to individual clients’ heterogeneous computing resource and network conditions. More precisely, we use a series of masks model an adaptive online pruning strategy that may change the pruning mask for any round and any client . Let denote the global model at the beginning of round and be the elementwise product. Thus, defines the trainable parameters of the pruned local model^{2}^{2}2While a pruned local model has a smaller number of parameters than the global model. We adopt the notation in [a, b] and use with an elementwise product to denote the pruned local model  only parameter corresponding to a 1value in the mask is accessible and trainable in the local model. for client in round .
Here, we describe one around (say the th) of the algorithm. First, the central server employs a pruning function to prune the latest global model and broadcast the resulting local models to clients:
(2) 
Each client then trains the pruned local model by performing updates for :
(3) 
where is the learning rate and are independent samples uniformly drawn form local data . We note that is a local stochastic gradient evaluated using only local parameters in (due to pruning) and that only locally trainable parameters are updated by the stochastic gradient (due to the elementwise product with mask ).
Finally, the central server aggregates the local models and produces an updated global model . Due to the use of arbitrary pruning masks in this paper, global parameters are broadcasted to and updated at different subsets of clients. To this end, we partition the global model into disjoint regions, such that parameters of region , denoted by , are included and only included by the same subset of local models. Let be the set of clients^{3}^{3}3Clearly is determined by the pruning mask since we have for and otherwise., whose local models contain parameters of region in round . The global model update of region is performed by aggregating local models at clients , i.e.,
(4) 
We summarize the algorithm in Algorithm 1.
Remark 1. We hasten to note that our framework captures heterogeneous FL with arbitrary adaptive online pruning strategies, so does our convergence analysis. It recovers many important FL algorithms recently proposed as special cases of our framework with arbitrary masks, including HeteroFL [diao2021heterofl] that uses fixed masks over time, PruneFL [jiang2020model] that periodically trains a fullsize local model for some , PruneandGrow [ma2021effective] that can be viewed as a singleclient algorithm without parameter aggregation, as well as FedAvg [fedavg] that employs fullsize local models at all clients. Our unifying framework provide a solid support for incorporating arbitrary model pruning strategies (such as weight weight or neuron pruning, CNNpruning, and sparsification) into heterogeneous FL algorithms. Our analysis establishes general conditions for any heterogeneous FL with arbitrary adaptive online pruning to converge to standard FL.
We make the following assumptions on . Assumptions 1 is a standard. Assumption 2 follows from [ma2021effective] and implies the noise introduced by pruning is relatively small and bounded. Assumptions 3 and 4 are standard for FL convergence analysis following from [zhang2013communication, stich2018local, yu2019parallel, convergenceNoniid] and assume the stochastic gradients to be bounded and unbiased.
(Smoothness). Cost functions are all Lsmooth: and any , we assume that there exists :
(5) 
(Pruninginduced Noise). We assume that for some and any , the pruninginduced error is bounded by
(6) 
(Bounded Gradient). The expected squared norm of stochastic gradients is bounded uniformly, i.e., for constant and any :
(7) 
(Gradient Noise for IID data). Under IID data distribution, for any , we assume that
(8)  
(9) 
for constant and independent samples .
We now analyze heterogeneous FL under arbitrary adaptive online pruning. To the best of our knowledge, this is the first proof that shows general convergence for this family of algorithms to a stationary point of standard FL (in Section 2.1) with smooth cost functions. We will first show convergence for IID data distributions, and by replacing Assumption 4 with a similar Assumption 5, show convergence for nonIID data distributions. We define an important value:
(10) 
referred to in this paper as the minimum covering index. Since is the number of local models containing parameters of region , measures the minimum occurrence of any parameters in the local models. Intuitively, if a parameter is never included in any local models, it is impossible for it to be updated. Thus conditions based on the covering index would be necessary for the convergence to standard FL. Our analysis establishes sufficient conditions for convergence. All proofs are collected in the Appendix.
Under Assumptions 14 and for arbitrary pruning satisfying , heterogeneous FL with adaptive online pruning converge as follows:
where , , and , are constants depending on the initial model parameters and the gradient noise.
Remark 2. Theorem 1 shows convergence to a stationary point of standard FL as long as (albeit some pruninginduced noise). The result is a bit surprising, since only requires any parameters to be included in at least one local model (which is necessary for all parameters to be updated during training). But we show that this is a sufficient condition for convergence to standard FL. Moreover, we also establish a convergence rate of for arbitrary pruning strategies satisfying the condition.
Remark 3. Impact of pruninginduced noise. In Assumption 2, we assume the pruninginduced noise is relatively small and bounded with respect to the global model: . This is satisfied in practice since most pruning strategies tend to focus on eliminating weights/neurons that are insignificant, therefore keeping indeed small. We note that pruning will incur an error term in our convergence analysis, which is proportional to and the average model norm (averaged over Q). It implies that more aggressive pruning in heterogeneous FL may lead to a larger error, deviating from standard FL at a speed quantified by . We note that this error is affected by both and .
Remark 4. Impact of minimum covering index . It turns out that the minimum number of occurrences of any parameter in the local models is a key factor deciding convergence. As increases, both constants and the convergence error are inverse proportional to . This result is a bit counterintuitive since certain parameters should be small enough to ignore in pruning. However, recall that our analysis shows convergence of all parameters in to a stationary point of standard FL (rather than for a subset of parameters or to a random point). The more times a parameter is covered by local models, the sooner it gets updated and convergences to the desired target. This is quantified in our analysis by showing that the error term due to pruning noise decreases at the rate of .
Remark 5.
When the cost function is strongly convex (e.g., for softmax classifier, logistic regression and linear regression with
2normalization), a stationary point becomes the global optimum. Thus, Theorem 1 shows convergence to the global optimum of standard FL for strongly convex cost functions.Remark 6. Theorem 1 inspires new design of adaptive online pruning for heterogeneous FL. Since the convergence gap is affected by both pruninginduced noise and minimum covering index , we may want to design pruning masks to preserve the largest parameters while sufficiently covering all parameters in different local models. The example shown in Figure 1 illustrates three alternative pruning strategies for clients. It can be seen that to achieve the best performance in heterogeneous FL, pruning masks need to be optimized to mitigate noise and achieve high covering index. Due to space limitations, optimal pruning mask design with respect to clients’ resource constraints will be considered in future work. We present numerical examples with different pruning mask designs (with improved performance for low and high ) in Section 5 to support the observation.
When data distribution is nonIID, we need a stronger assumption to ensure stochastic gradients computed on a subset of clients’ datasets still provide a nonbiased estimate for each parameter region. To this end, we replace Assumption 4 by a similar Assumption 5 for nonIID.
(Gradient Noise for nonIID data). Under nonIID data distribution, we assume that for constant and any :
Under Assumptions 13 and 5, heterogeneous FL with arbitrary adaptive online pruning strategy satisfying converges as follows:
where , and is same constant as before.
Remark 7. With Assumption 5, the convergence under nonIID data distributions is very similar to that in Theorem, except for different constants and . Thus, most remarks made for Theorem 1, including convergence speed, pruninginduced noise, and pruning mask design, still apply. We notice that no longer plays a role in the convergence error. This is because the stochastic gradients computed by different clients in
now are based on different datasets and jointly provide an unbiased estimate, no longer resulting in smaller statistical noise.
Remark 8. We note that Assumption 5 can be satisfied in practice by jointly designing pruning masks and data partitions among the clients. For example, for clients with local data respectively, we can design 4 pruning masks like , , , or , , , . It is easy to show that these satisfy the gradient noise assumption for nonIID data distribution. Due to space limitations, optimal pruning mask design based on data partitioning will be considered in future work. Nevertheless, we present numerical examples under nonIID data distribution with different pruning mask designs in Section 5. When the conditions in Theorem 2 are satisfied, we observe convergence and significant improvement in performance.
In this section we evaluate different pruning techniques from stateoftheart designs and verify our proposed theory under unifying pruning framework using two datasets. Unless stated otherwise, the accuracy reported is defined as averaged over three random seeds with same random initialized starting . We focus on three points in our experiments: (i) the general coverage of federated learning with heterogeneous models by pruning (ii) the impact of minimum coverage index (iii) the impact of pruninginduced noise . The experiment results provide a comprehensive comparison among several pruning techniques with modification of our new design to verify the correctness of our theory.
We examine theoretical results on the following two common image classification datasets: MNIST [minist] and CIFAR10 [krizhevsky2009learning], among workers with IID and nonIID data with participation ratio . For IID data, we follow the design of balanced MNIST by [convergenceNoniid]
, and similarly obtain balanced CIFAR10. For nonIID data, we obtained balanced partition with label distribution skewed, where the number of the samples on each device is up to at most two out of ten possible classifications.
To empirically verify the correctness of our theory, we pick FedAvg, which can be considered federated learning with full local models, and 4 other pruning techniques^{1}^{1}1FullNets can be considered as FedAvg [fedavg] without any pruning. We use ”WP” for weights pruning as used in PruneFL[jiang2020model], ”NP” for neuron pruning as used in [shao2019privacy], ”FS” for fixed subnetwork as used in HeteroFL [diao2021heterofl] and ”PT” for pruning with a pretrained mask as used in [frankle2019lottery]; for notation and demonstration simplicity. from stateoftheart federated learning with heterogeneous model designs as baselines. Let be the sparsity of mask , e.g., for a model when 25 % of its weights are pruned. Due to page limits, we show selected combinations over 3 pruning levels named L (Large), M (medium) and S (Small): L. 60% workers with full model and 40% workers with 75% pruned model; M. 40 % workers with full model and 60% workers with 75% pruned model; S. 10% workers with full model, 30% workers with 75% pruned model and 60 % workers with 50% pruned models.
For each round 10 devices are randomly selected to run steps of SGD. We evaluate the averaged model after each global aggregation on the corresponding global objective and show the global loss in Figure 2. We present the key model characteristics as well as their model accuracy after
rounds of training on MNIST(IID and nonIID) and CIFAR 10 in table 1 and table 2. The FLOPs and Space stand for amortized FLOPs for one local step and memory space needed to store parameters for one model, with their ratio representing the corresponding communication and computation savings compared with FedAvg which uses a fullsize model.
In the experiments, NP, FS, WP, and PT use the same architecture but the latter two are trained with sparse parameters, while FS and NP are trained on actual reduced network size. To better exemplify and examine the results, we run all experiments on the small model architecture: an MLP with a single hidden layer for MNIST and a LeNet5 like network with 2 convolutions layers for CIFAR10. As some large DNN models are proved to have the ability to maintain their performance with a reasonable level of pruning, we use smaller networks to avoid the potential influence from very large networks, as well as other tricks and model characteristics of each framework. More details regarding models and experiment design can be found at Appendix.2.
More results including other possible combinations, pruning details with analysis, and other experiment details can be found at Appendix.3.
Model  FLOPs  Space  Ratio  Accuracy  
IID  NonIID  
Local  Global  
FullNets  158.8K  1.27M  1.00  10  98.01  93.82  93.59 
WPL1  143.12K  1.15M  0.90  6  98.18  95.49  95.15 
NPL1  142.9K  1.14M  0.90  6  97.97  93.82  93.6 
FSL1  142.9K  1.14M  0.90  6  97.76  92.55  92.33 
*WPM1  135.5K  1.08M  0.85  8  98.39  95.82  95.48 
WPM2  135.5K  1.08M  0.85  4  97.51  89.29  89.13 
*NPM1  135.0K  1.08M  0.85  8  97.86  92.42  91.90 
NPM2  135.0K  1.08M  0.85  4  97.53  92.07  91.70 
FSM1  135.0K  1.08M  0.85  4  97.62  92.33  92.05 
*WPS1  100.0K  0.80M  0.63  5  95.32  81.64  81.66 
WPS2  100.0K  0.80M  0.63  5  95.10  72.19  71.64 
*NPS1  91.3K  0.73M  0.57  3  94.41  62.49  61.96 
NPS2  91.3K  0.73M  0.57  3  95.21  60.54  61.86 
FSS1  91.3K  0.73M  0.57  1  96.88  90.67  90.73 
Model  FLOPs 

Space 

Accuracy  
FullNets  653.8K  1.00  512.8K  1.00  10  53.63  
WPL1  619.6K  0.94  482.3K  0.94  8  53.12  
FSL1  619.6K  0.94  476.3K  0.93  8  53.08  
WPM1  587.0K  0.89  451.9K  0.89  6  52.66  
*WPM2  587.0K  0.89  451.9K  0.89  7  52.99  
*WPM3  587.0K  0.89  451.9K  0.89  8  54.20  
FSM1  585.5K  0.85  440.0K  0.89  6  51.87  
WPS1  553.7K  0.84  421.5K  0.82  4  51.69  
*WPS2  553.7K  0.84  421.5K  0.82  7  52.20  
FSS1  551.4K  0.84  403.5K  0.78  4  50.96 
Our theory suggests that for a given pruning level, we expect that the minimum coverage index is a hyperbolic function of the global loss as theorem 1 indicates. Then for a given pruning level, a higher minimum coverage index may reduce the the standing point of convergence since contains a term , which could potentially lead to better performance. Note that existing pruning techniques and federated learning with heterogeneous models by pruning will always discard the partition in which the parameters are smaller than a certain threshold determined by the pruning policy and pruning level.
To illustrate the importance of such minimum coverage index, we consider the parameters of a model is argsorted based on certain pruning technique policy and then four sets are thereby generated representing the highest 25% partition to the lowest 25% partition: , where a mask generated based on existing pruning technique for a 75% sparsity model is then defined as . It is easy to see is then directly determined by the number of models with highest pruning levels, e.g. for experiment case 2: 40 % workers with full models and 60% workers with 75% pruned models.
To verify the impact of the minimum coverage index, we propose a way to increase the minimum coverage index without changing the network architecture or introducing extra computation overhead: by a simple way to increase the usage of parameters below pruning threshold: let partial local models train with pruned partition.
As an example shown in Fig 1, for model with code name *WPM1, from which 2 out of 6 models with 75% pruned model using regular weights pruning technique, with the other 4 each two use with and with , so that is then achieved. We denote such design to maximize the minimum coverage index on current pruning techniques as STAR(*) in the results. For detailed case settings and corresponding pruning techniques see Appendix 2.
As shown in Figure 2(c), under the identical model setting with the same pruning level, both pruning techniques with different minimum coverage index show different convergence scenarios, specifically, the design with higher minimum coverage index is able to reach a solution with lower loss within the training round limits. It is also observed in table 1 and table 2 that they can reach higher accuracy with both IID and nonIID data, whereas for nonIID the improvements are more significant. There are even cases where settings under our design with fewer communication and computation costs that perform better than a regular design with more costs, e.g. ”*WPM1” over ”WPL1” on both IID and nonIID data. More examples and results can be found in Appendix.3
As suggested in our theory, another key factor that contributes to the convergence is the pruninginduced noise . When the client model is treated with pruning or a sparse mask, then inevitably the pruninginduced noise will affect the convergency and model accuracy. Given the same minimum coverage index, generally, a smaller pruninginduced noise will lead to a lower convergence point and potentially a more accurate model.
For this phenomenon, we focus on the Fixed Subnetwork method, which does not involve the adaptive changing of mask, and tested higher pruning levels as shown in figure 2(d) and confirms such a trend. As shown in Figure 2(b), all selected pruning methods are affected by the change of pruning level. In Figure 2(f), where model WP21 trains with a relatively steady trend, it becomes unsteady for model WP31, which could be due to the change of pruning mask before its local convergence. This may suggest that on high pruning level without a properly designed pruning technique, using fixed subnetwork may bring a more robust and accurate model.
Besides on the verification of our theory, we have additionally noticed several phenomenon, of which some confirm previous research while some others may require further investigation for a theoretical support.
In Figure 2(a), under a similar pruning level, PT converges a lot slower than others with its pretrained mask, as also suggested by previous works that models with static sparse parameters will converge to a solution with higher loss than models with dynamic sparse training. Also, it suggests that it is unlikely to find such a lottery ticket to an optimal mask within limited rounds of training, especially without a carefully designed algorithm.
Although generally a higher pruning level will result in a higher loss in training, different pruning techniques have different sensitivity towards it. As an example Fixed Subnetwork has a relatively low sensitivity towards high pruning levels, this could be due to by using the static continuous mask will avoid the situation where pruning mask is changed before its local convergence, which makes it more stable on different pruning levels.
Nevertheless, for most cases our design to increase minimum coverage index could deliver improvement on increasing model accuracy and reducing the global loss, pruninginduced noise is another key factor to notice, especially with higher pruning levels such design to merely focus on increasing minimum coverage index may not bring significant improvements.
Finally, we also show a synthetic special case where (the proposed necessary conditions are not met)all local clients do not sum up a mask that covers the whole model in Appendix.4, which under this situation the model did not learn a usable solution.
In this paper, we establish (for the first time) the sufficient conditions for FL with heterogeneous local models and arbitrary adaptive pruning to converge to a stationary point of standard FL, at a rate of . It applies to general smooth cost functions and recovers a number of important FL algorithms as special cases. The analysis advocates designing pruning strategies with respect to both minimum coverage index and pruninginduced noise . We further empirically demonstrated the correctness of the theory and the performances of the proposed design. Our work provides a theoretical understanding of FL with heterogeneous clients and dynamic pruning, and presents valuable insights on FL algorithm design, which will be considered in future work.
We summarize the algorithm in a way that can present the convergence analysis more easily. We use a superscript such as , , and
to denote the subvector of parameter, mask, and gradient corresponding to region
. In each round , parameters in each region is contained in and only in a set of local models denoted by , implying that for and otherwise. We define as the minimum coverage index, since it denotes the minimum number of local models that contain any parameters in . With slight abuse of notations, we use and to denote the the gradient and stochastic gradient, respectively.(Smoothness). Cost functions are all Lsmooth: and any , we assume that there exists :
(11) 
(Pruninginduced Error). We assume that for some and any , the pruninginduced error is bounded by
(12) 
(Bounded Gradient). The expected squared norm of stochastic gradients is bounded uniformly, i.e., for constant and any :
(13) 
(Gradient Noise for IID data). Under IID data distribution, for any , we assume that
(14)  
(15) 
where is a constant and are independent samples for different .
(Gradient Noise for nonIID data). Under nonIID data distribution, we assume that for constant and any :
(16)  
(17) 
We now analyze the convergence of heterogeneous FL under adaptive online model pruning with respect to any pruning policy (and the resulting mask ) and prove the main theorems in this paper. We need to overcome a number of challenges as follows:
We will begin the proof by analyzing the change of loss function in one round as the model goes from to , i.e., . It includes three major steps: pruning to obtain heterogeneous local models , training local models in a distributed fashion to update , and parameter aggregation to update the global model .
Due to the use of heterogeneous local models whose masks both vary over rounds and change for different workers, we first characterize the difference between (i) local model at any epoch and global model at the beginning of the current round. It is easy to see that this can be factorized into two parts: pruning induced error and local training , which will be analyzed in Lemma 1.
We characterize the impact of heterogeneous local models on global parameter update. Specifically, We use an ideal local gradient as a reference point and quantify the different between aggregated local gradients and the ideal gradient. This will be presented in Lemma 2. We also quantify the norm difference between a gradient and a stochastic gradient (with respect to the global update step) using the gradient noise assumptions, in Lemma 3.
Since IID and nonIID data distributions in our model differ in the gradient noise assumption (i.e., Assumption 4 and Assumption 5), we present a unified proof for both cases. We will explicitly state IID and nonIID data distributions only if the two cases require different treatment (when the gradient noise assumptions are needed). Otherwise, the derivations and proofs are identical for both cases.
We will begin by proving a number of lemmas and then use them for convergence analysis.
Under Assumption 2 and Assumption 3, for any , we have:
(18) 
We note that is the global model at the beginning of current round. We split the difference into two parts: changes due to local model training and changes due to pruning . That is
(19) 
where we used the fact that in the last step.
For the first term in Eq.(19), we notice that is obtained from through epochs of local model updates on worker . Using the local gradient updates from the algorithm, it is easy to see:
(20) 
where we use the fact that in step 2 above, and the fact that is a binary mask in step 3 above together with Assumption 3 for bounded gradient.
For the second term in Eq.(19), the difference is resulted by model pruning using mask of work in round . We have
(21)  
where we used the fact that in step 1 above, and Assumption 2 in step 2 above.
Under Assumptions 13, for any , we have:
(22) 
Recall that is the number of local models containing parameters of region in round . The lefthandside of Eq.(22) denotes the difference between an average gradient of heterogeneous models (through aggregation and over time) and an ideal gradient. The summation over adds up such difference over all regions , because the average gradient takes a different form in different regions.
From the inequality , we obtain . We use this inequality on the lefthandside of Eq.(22) to get:
(23) 
where we relax the inequality by choosing the smallest and changing the summation over to all workers in the second step. In the third step, we use the fact that gradient norm of a vector is equal to the sum of norm of all subvectors (i.e., regions ). This allows us to consider instead of its subvectors on different regions.
Finally, the last step is directly from Lsmoothness in Assumption 1. Under Assumptions 23, we notice that the last step of Eq.(23) is further bounded by Lemma 1, which yields the desired result of this lemma after rearranging the terms. ∎
For IID data distribution under Assumptions 4, for any , we have:
For nonIID data distribution under Assumption 5, for any , we have:
This lemma quantifies the square norm of the difference between gradient and stochastic gradient in the global parameter update. We present results for both IID and nonIID cases in this lemma under Assumption 4 and Assumption 5, respectively.
We first consider IID data distributions. Since all the samples are independent from each other for different and , the difference between gradient and stochastic gradient, i.e., , are independent gradient noise. Due to Assumption 4, these gradient noise has zero mean. Using the fact that for zeromean and independent ’s, we get: