On the Convergence of Heterogeneous Federated Learning with Arbitrary Adaptive Online Model Pruning

01/27/2022
by   Hanhan Zhou, et al.
Tsinghua University
0

One of the biggest challenges in Federated Learning (FL) is that client devices often have drastically different computation and communication resources for local updates. To this end, recent research efforts have focused on training heterogeneous local models obtained by pruning a shared global model. Despite empirical success, theoretical guarantees on convergence remain an open question. In this paper, we present a unifying framework for heterogeneous FL algorithms with arbitrary adaptive online model pruning and provide a general convergence analysis. In particular, we prove that under certain sufficient conditions and on both IID and non-IID data, these algorithms converges to a stationary point of standard FL for general smooth cost functions, with a convergence rate of O(1/√(Q)). Moreover, we illuminate two key factors impacting convergence: pruning-induced noise and minimum coverage index, advocating a joint design of local pruning masks for efficient training.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/12/2022

Over-the-Air Federated Learning with Joint Adaptive Computation and Power Control

This paper considers over-the-air federated learning (OTA-FL). OTA-FL ex...
05/27/2022

Towards Communication-Learning Trade-off for Federated Learning at the Network Edge

In this letter, we study a wireless federated learning (FL) system where...
12/25/2021

Towards Federated Learning on Time-Evolving Heterogeneous Data

Federated Learning (FL) is an emerging learning paradigm that preserves ...
06/10/2022

On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond

The FedProx algorithm is a simple yet powerful distributed proximal poin...
04/25/2022

FedDUAP: Federated Learning with Dynamic Update and Adaptive Pruning Using Shared Data on the Server

Despite achieving remarkable performance, Federated Learning (FL) suffer...
06/29/2022

AFAFed – Protocol analysis

In this paper, we design, analyze the convergence properties and address...
07/11/2020

Federated Learning's Blessing: FedAvg has Linear Speedup

Federated learning (FL) learns a model jointly from a set of participati...

Code Repositories

1 Introduction

Federated Learning (FL) allows distributed clients to collaborate and train a centralized global model without the transmission of local data. In practice, mobile and edge devices that are equipped with drastically different computation and communication capabilities are becoming the dominant source for FL [lim2020federated]. This has prompted significant recent attention to a family of FL algorithms focusing on training heterogeneous local models (often obtained through pruning a global model). It includes algorithms like HeteroFL [diao2021heterofl] that employ heterogeneous local models with fixed structures, algorithms utilizing pre-trained local models like [frankle2019lottery], as well as algorithms like PruneFL [jiang2020model] that update local models adaptively during training. However, the success of these algorithms has only been demonstrated empirically (e.g., [diao2021heterofl, jiang2020model]). Unlike standard FL that has received rigorous theoretical analysis [wang2018cooperative, bonawitz2019towards, yu2019parallel, convergenceNoniid], the convergence of heterogeneous FL with adaptive online model pruning is still an open question. Little is known about whether such algorithms converge to a solution of standard FL.

To answer these questions, in this paper we present a unifying framework for heterogeneous FL algorithms with arbitrary adaptive online model pruning and provide a general convergence analysis. There have been many existing efforts in establishing convergence guarantees for FL algorithms, such as the popular FedAvg [fedavg], on both IID and non-IID 111Throughout this paper, “non-IID data” means that the data among local clients are not independent and identically distributed.data distributions, but all rely on the assumption that there can only exist one uniform structure on all client devices. By considering arbitrary pruning strategies in our framework, we formally establish the convergence conditions for a general family of FL algorithms with both (i) heterogeneous local models to accommodate different resource constraints on client devices and (ii) time-varying local models to continuously refine pruning results during training. We prove that these FL algorithms with arbitrary pruning strategies satisfying certain sufficient conditions can indeed converge (at a speed of , where is the number of communication rounds) to a stationary point of standard FL for general smooth cost functions.

To the best of our knowledge, this is the first convergence analysis for heterogeneous FL with arbitrary adaptive online model pruning. The framework captures a number of existing FL algorithms as important special cases and provide a general convergence guarantee to them, including HeteroFL [diao2021heterofl] that employs fixed-structure local models, PruneFL [jiang2020model] that requires periodically training a full-size model, and S-GaP [ma2021effective] that can be viewed as a single-client version. Moreover, we show that the convergence gap is affected by both pruning-induced noise (i.e., modeled through a constant ) and a new notion of minimum coverage index (i.e., any parameters in the global model are covered in at least local models). In particular, it advocates a joint design of efficient local-model pruning strategies (e.g., leveraging [wen2016learning, li2016pruning, ciresan2011flexible]) for efficient training. Our results provides a solid theoretical support for designing heterogeneous FL algorithms with efficient pruning strategies, while ensuring similar convergence as standard FL.

We carried out extensive experiments on two datasets which suggest that for a given level of model sparsity, client models should also consider the maximization of the coverage index rather than only keeping the largest parameters through pruning. As an example, a federated learning network with 85% sparsity obtained via our design to maximize converge index achieves up to 8% of improvement compared to the network generated by pruning with the identical model architecture without posing any additional computation overhead.

In summary, our paper makes the following key contributions:

  • We propose a unifying framework for heterogeneous FL with arbitrary adaptive online model pruning. It captures a number of existing algorithms (whose success are empirically demonstrated) as special cases and allows convergence analysis.

  • The general convergence of these algorithms are established. On both IID and non-IID data, we prove that under standard assumptions and certain sufficient conditions on pruning strategy, the algorithms converge to a stationary point of standard FL for smooth cost functions.

  • We further analyze the impact of key factors contributing to the convergence and further advocate a joint design of local pruning masks with respect to both pruning-induced error a notion of minimum coverage index. The results are validated on MNIST and CIFAR10 datasets.

2 Background

Standard Federated Learning A standard Federated Learning problem considers a distributed optimization for N clients:

(1)

Here is as set of trainable weights/parameters, is a cost function defined on data set

with respect to a user specified loss function

, and is the weight for the -th client such that and .

The FL procedure, e.g., FedAvg [fedavg]

, typically consists of a sequence of stochastic gradient descent steps performed distributedly on each local objective, followed by a central step collecting the workers’ updated local parameters and computing an aggregated global parameter. For the

-th round of training, first the central server broadcasts the latest global model parameters to clients , who performs local updates as follows:

where is the local learning rate. After all available clients have concluded their local updates (in epochs), the server will aggregate parameters from them and generate the new global model for the next round, i.e.,

The formulation captures FL with both IID and non-IID data distributions.

Model Pruning

. Model pruning via weights and connections pruning is one of the promising methods to enable efficient neural networks by setting the proportion of weights and biases to zero and thus bringing reduction to both computation and memory usage. Most works on weights pruning require 3 phases of training: pre-training phase, pruning to sparse phase, and fine-tune phase. For a neural network

with parameters and input data . The pruning process takes as input and generates a new model , where is a binary mask to denote certain parameters to be set to zero and denotes element-wise multiplication. The pruning mask is computed from a certain pruning policy

, e.g., layer-wise parameter pruning removing weights below certain percentile and neuron pruning removing neurons with small average weights. We use

to denote the pruned model, which has a reduced model size and is more efficient for communication and training.

3 Related Work

Federated Averaging and Communication Efficient FL. FedAvg [fedavg] is considered the first and the most commonly used federated learning algorithm, where for each round of training local clients trains using their own data, with their parameters averaged at the central server. FedAvg is able to reduce communication costs by training clients for multiple rounds locally. Several works have shown the convergence of FedAvg under several different settings with both homogeneous (IID) data [wang2018cooperative, woodworth2018graph] and heterogeneous (non-IID) data [convergenceNoniid, bonawitz2019towards, yu2019parallel] even with partial clients participation. Specifically, [yu2019parallel] demonstrated LocalSGD achieves convergence for non-convex optimization and [convergenceNoniid] established a convergence rate of for strongly convex problems on FedAvg, where Q is the number of SGDs and N is the number of participated clients. Several works [karimireddy2020scaffold, wang2019adaptive, wang2019adaptive22]are proposed to further reduce the communication costs. One direction is to use data compression such as quantization [konevcny2016federated, bonawitz2019towards, mao2021communication, yao2021fedhm], sketching [alistarh2017qsgd, ivkin2019communication], split learning [thapa2020splitfed] and learning with gradient sparsity [han2020adaptive]. This type of work does not consider computation efficiency.

Neural Network Pruning and Sparsification. To reduce the computation costs of a neural network, neural network pruning is a popular research topic. A magnitude-based prune-from-dense methodology [han2015learning, guo2016dynamic, yu2018nisp, liu2018rethinking, real2019regularized] is widely used where weights smaller than certain preset threshold are removed from the network. In addition, there are one-shot pruning initialization [lee2018snip], iterative pruning approach [zhu2017prune, narang2017exploring] and adaptive pruning approach [lin2020dynamic, ma2021effective] that allows network to grow and prune. In [frankle2019lottery, morcos2019one] a ”lottery ticket hypothesis” was proposed that with an optimal substructure of the neural network acquired by weights pruning directly train a pruned model could reach similar results as pruning a pre-trained network. The other direction is through sparse mask exploration [bellec2017deep, mostafa2019parameter, evci2020rigging]

, where a sparsity in neural networks are maintained during the training process, while the fraction of the weights is explored based on random or heuristics methods.

[frankle2019lottery, mostafa2019parameter]

empirically observed training of models with static sparse parameters will converge to a solution with higher loss than models with dynamic sparse training. Note that the efficient sparse matrix multiplication sometimes requires special libraries or hardware, e.g. the sparse tensor cores in NVIDIA A100 GPU, to achieve the actual reduction in memory footprint and computational resources.

Efficient FL with Heterogeneous Neural Networks. Several works are proposed to address the reduction of both computation and communication costs, including one way to utilize lossy compression and dropout techniques[caldas2018expanding, xu2019elfish]. Although early works mainly assume that all local models are to share the same architecture as the global model [li2020federated], recent works have empirically demonstrated that federated learning with heterogeneous client model to save both computation and communication is feasible. PruneFL[jiang2020model] proposed an approach with adaptive parameter pruning during federated learning. [li2021fedmask] proposed federated learning with a personalized and structured sparse mask. HetroFL[diao2021heterofl]

proposed to generate heterogeneous local models as a subnet of the global network by picking the leading continuous parameters layer-wise with the help of proposed static batch normalization, while

[li2021hermes] finds the small subnetwork by applying the structured pruning. Despite their empirical success, they lack theoretical convergence guarantees even in convex optimization settings.

4 Our Main Results

We rigorously analyze the convergence of heterogeneous FL under arbitrary adaptive online model pruning and establish the conditions for converging to a stationary point of standard FL with general smooth cost functions. The theory results in this paper not only illuminate key convergence properties but also provide a solid support for designing adaptive pruning strategies in heterogeneous FL algorithms.

4.1 FL under Arbitrary Adaptive Online Pruning

In this paper, we focus on a family of FL algorithms that leverage adaptive online model pruning to train heterogeneous local models on distributed clients. By considering arbitrary pruning strategies in our formulation, it relaxes a number of key limitations in standard FL: (i) Pruning masks are allowed to be time-varying, enabling online adjustment of pruned local models during the entire training process. (ii) The pruning strategies may vary for different clients, making it possible to optimize the pruned local models with respect to individual clients’ heterogeneous computing resource and network conditions. More precisely, we use a series of masks model an adaptive online pruning strategy that may change the pruning mask for any round and any client . Let denote the global model at the beginning of round and be the element-wise product. Thus, defines the trainable parameters of the pruned local model222While a pruned local model has a smaller number of parameters than the global model. We adopt the notation in [a, b] and use with an element-wise product to denote the pruned local model - only parameter corresponding to a 1-value in the mask is accessible and trainable in the local model. for client in round .

Here, we describe one around (say the th) of the algorithm. First, the central server employs a pruning function to prune the latest global model and broadcast the resulting local models to clients:

(2)

Each client then trains the pruned local model by performing updates for :

(3)

where is the learning rate and are independent samples uniformly drawn form local data . We note that is a local stochastic gradient evaluated using only local parameters in (due to pruning) and that only locally trainable parameters are updated by the stochastic gradient (due to the element-wise product with mask ).

Finally, the central server aggregates the local models and produces an updated global model . Due to the use of arbitrary pruning masks in this paper, global parameters are broadcasted to and updated at different subsets of clients. To this end, we partition the global model into disjoint regions, such that parameters of region , denoted by , are included and only included by the same subset of local models. Let be the set of clients333Clearly is determined by the pruning mask since we have for and otherwise., whose local models contain parameters of region in round . The global model update of region is performed by aggregating local models at clients , i.e.,

(4)

We summarize the algorithm in Algorithm 1.

Remark 1. We hasten to note that our framework captures heterogeneous FL with arbitrary adaptive online pruning strategies, so does our convergence analysis. It recovers many important FL algorithms recently proposed as special cases of our framework with arbitrary masks, including HeteroFL [diao2021heterofl] that uses fixed masks over time, PruneFL [jiang2020model] that periodically trains a full-size local model for some , Prune-and-Grow [ma2021effective] that can be viewed as a single-client algorithm without parameter aggregation, as well as FedAvg [fedavg] that employs full-size local models at all clients. Our unifying framework provide a solid support for incorporating arbitrary model pruning strategies (such as weight weight or neuron pruning, CNN-pruning, and sparsification) into heterogeneous FL algorithms. Our analysis establishes general conditions for any heterogeneous FL with arbitrary adaptive online pruning to converge to standard FL.

Input: Local data on clients, pruning policy .
Executes:
Initialize
for round  do
       for local workers (In parallel) do
             Generate mask
             Prune
             Update local models:
             for epoch  do
                  
             end for
            
       end for
       Update global model:
       for region  do
             Find
             Update
       end for
      
end for
Output
Algorithm 1 Our unifying framework.

4.2 Notations and Assumptions

We make the following assumptions on . Assumptions 1 is a standard. Assumption 2 follows from [ma2021effective] and implies the noise introduced by pruning is relatively small and bounded. Assumptions 3 and 4 are standard for FL convergence analysis following from [zhang2013communication, stich2018local, yu2019parallel, convergenceNoniid] and assume the stochastic gradients to be bounded and unbiased.

Assumption 1.

(Smoothness). Cost functions are all L-smooth: and any , we assume that there exists :

(5)
Assumption 2.

(Pruning-induced Noise). We assume that for some and any , the pruning-induced error is bounded by

(6)
Assumption 3.

(Bounded Gradient). The expected squared norm of stochastic gradients is bounded uniformly, i.e., for constant and any :

(7)
Assumption 4.

(Gradient Noise for IID data). Under IID data distribution, for any , we assume that

(8)
(9)

for constant and independent samples .

4.3 Convergence Analysis

We now analyze heterogeneous FL under arbitrary adaptive online pruning. To the best of our knowledge, this is the first proof that shows general convergence for this family of algorithms to a stationary point of standard FL (in Section 2.1) with smooth cost functions. We will first show convergence for IID data distributions, and by replacing Assumption 4 with a similar Assumption 5, show convergence for non-IID data distributions. We define an important value:

(10)

referred to in this paper as the minimum covering index. Since is the number of local models containing parameters of region , measures the minimum occurrence of any parameters in the local models. Intuitively, if a parameter is never included in any local models, it is impossible for it to be updated. Thus conditions based on the covering index would be necessary for the convergence to standard FL. Our analysis establishes sufficient conditions for convergence. All proofs are collected in the Appendix.

Theorem 1.

Under Assumptions 1-4 and for arbitrary pruning satisfying , heterogeneous FL with adaptive online pruning converge as follows:

where , , and , are constants depending on the initial model parameters and the gradient noise.

Remark 2. Theorem 1 shows convergence to a stationary point of standard FL as long as (albeit some pruning-induced noise). The result is a bit surprising, since only requires any parameters to be included in at least one local model (which is necessary for all parameters to be updated during training). But we show that this is a sufficient condition for convergence to standard FL. Moreover, we also establish a convergence rate of for arbitrary pruning strategies satisfying the condition.

Remark 3. Impact of pruning-induced noise. In Assumption 2, we assume the pruning-induced noise is relatively small and bounded with respect to the global model: . This is satisfied in practice since most pruning strategies tend to focus on eliminating weights/neurons that are insignificant, therefore keeping indeed small. We note that pruning will incur an error term in our convergence analysis, which is proportional to and the average model norm (averaged over Q). It implies that more aggressive pruning in heterogeneous FL may lead to a larger error, deviating from standard FL at a speed quantified by . We note that this error is affected by both and .

Remark 4. Impact of minimum covering index . It turns out that the minimum number of occurrences of any parameter in the local models is a key factor deciding convergence. As increases, both constants and the convergence error are inverse proportional to . This result is a bit counter-intuitive since certain parameters should be small enough to ignore in pruning. However, recall that our analysis shows convergence of all parameters in to a stationary point of standard FL (rather than for a subset of parameters or to a random point). The more times a parameter is covered by local models, the sooner it gets updated and convergences to the desired target. This is quantified in our analysis by showing that the error term due to pruning noise decreases at the rate of .

Remark 5.

When the cost function is strongly convex (e.g., for softmax classifier, logistic regression and linear regression with

2-normalization), a stationary point becomes the global optimum. Thus, Theorem 1 shows convergence to the global optimum of standard FL for strongly convex cost functions.

Remark 6. Theorem 1 inspires new design of adaptive online pruning for heterogeneous FL. Since the convergence gap is affected by both pruning-induced noise and minimum covering index , we may want to design pruning masks to preserve the largest parameters while sufficiently covering all parameters in different local models. The example shown in Figure 1 illustrates three alternative pruning strategies for clients. It can be seen that to achieve the best performance in heterogeneous FL, pruning masks need to be optimized to mitigate noise and achieve high covering index. Due to space limitations, optimal pruning mask design with respect to clients’ resource constraints will be considered in future work. We present numerical examples with different pruning mask designs (with improved performance for low and high ) in Section 5 to support the observation.

Figure 1: Illustration of our method. (a): Existing method utilizing pruning will always discard the parameters below the threshold. (b,c): Our method will utilize different partitions for a higher . Note FL with these 3 settings have nearly identical communication and computation costs.

When data distribution is non-IID, we need a stronger assumption to ensure stochastic gradients computed on a subset of clients’ datasets still provide a non-biased estimate for each parameter region. To this end, we replace Assumption 4 by a similar Assumption 5 for non-IID.

Assumption 5.

(Gradient Noise for non-IID data). Under non-IID data distribution, we assume that for constant and any :

Theorem 2.

Under Assumptions 1-3 and 5, heterogeneous FL with arbitrary adaptive online pruning strategy satisfying converges as follows:

where , and is same constant as before.

Remark 7. With Assumption 5, the convergence under non-IID data distributions is very similar to that in Theorem, except for different constants and . Thus, most remarks made for Theorem 1, including convergence speed, pruning-induced noise, and pruning mask design, still apply. We notice that no longer plays a role in the convergence error. This is because the stochastic gradients computed by different clients in

now are based on different datasets and jointly provide an unbiased estimate, no longer resulting in smaller statistical noise.

Remark 8. We note that Assumption 5 can be satisfied in practice by jointly designing pruning masks and data partitions among the clients. For example, for clients with local data respectively, we can design 4 pruning masks like , , , or , , , . It is easy to show that these satisfy the gradient noise assumption for non-IID data distribution. Due to space limitations, optimal pruning mask design based on data partitioning will be considered in future work. Nevertheless, we present numerical examples under non-IID data distribution with different pruning mask designs in Section 5. When the conditions in Theorem 2 are satisfied, we observe convergence and significant improvement in performance.

5 Experiments

5.1 Experiment settings

In this section we evaluate different pruning techniques from state-of-the-art designs and verify our proposed theory under unifying pruning framework using two datasets. Unless stated otherwise, the accuracy reported is defined as averaged over three random seeds with same random initialized starting . We focus on three points in our experiments: (i) the general coverage of federated learning with heterogeneous models by pruning (ii) the impact of minimum coverage index (iii) the impact of pruning-induced noise . The experiment results provide a comprehensive comparison among several pruning techniques with modification of our new design to verify the correctness of our theory.

We examine theoretical results on the following two common image classification datasets: MNIST [minist] and CIFAR10 [krizhevsky2009learning], among workers with IID and non-IID data with participation ratio . For IID data, we follow the design of balanced MNIST by [convergenceNoniid]

, and similarly obtain balanced CIFAR10. For non-IID data, we obtained balanced partition with label distribution skewed, where the number of the samples on each device is up to at most two out of ten possible classifications.

5.2 Baselines and Test Case Notations

To empirically verify the correctness of our theory, we pick FedAvg, which can be considered federated learning with full local models, and 4 other pruning techniques111FullNets can be considered as FedAvg [fedavg] without any pruning. We use ”WP” for weights pruning as used in PruneFL[jiang2020model], ”NP” for neuron pruning as used in [shao2019privacy], ”FS” for fixed sub-network as used in HeteroFL [diao2021heterofl] and ”PT” for pruning with a pre-trained mask as used in [frankle2019lottery]; for notation and demonstration simplicity. from state-of-the-art federated learning with heterogeneous model designs as baselines. Let be the sparsity of mask , e.g., for a model when 25 % of its weights are pruned. Due to page limits, we show selected combinations over 3 pruning levels named L (Large), M (medium) and S (Small): L. 60% workers with full model and 40% workers with 75% pruned model; M. 40 % workers with full model and 60% workers with 75% pruned model; S. 10% workers with full model, 30% workers with 75% pruned model and 60 % workers with 50% pruned models.

For each round 10 devices are randomly selected to run steps of SGD. We evaluate the averaged model after each global aggregation on the corresponding global objective and show the global loss in Figure 2. We present the key model characteristics as well as their model accuracy after

rounds of training on MNIST(IID and non-IID) and CIFAR 10 in table 1 and table 2. The FLOPs and Space stand for amortized FLOPs for one local step and memory space needed to store parameters for one model, with their ratio representing the corresponding communication and computation savings compared with FedAvg which uses a full-size model.

In the experiments, NP, FS, WP, and PT use the same architecture but the latter two are trained with sparse parameters, while FS and NP are trained on actual reduced network size. To better exemplify and examine the results, we run all experiments on the small model architecture: an MLP with a single hidden layer for MNIST and a LeNet-5 like network with 2 convolutions layers for CIFAR10. As some large DNN models are proved to have the ability to maintain their performance with a reasonable level of pruning, we use smaller networks to avoid the potential influence from very large networks, as well as other tricks and model characteristics of each framework. More details regarding models and experiment design can be found at Appendix.2.

More results including other possible combinations, pruning details with analysis, and other experiment details can be found at Appendix.3.

(a) Different Pruning Techniques
(b) Impact of Pruning Level
(c) Impact of
(d) Impact of Mask Error
(e) Different Pruning Techniques w/ non-IID
(f) Impact of w/ non-IID
Figure 2: Selected experimental results for MNIST (IID and Non-IID) dataset between different pruning settings. (a) For similar pruning level, PT converge a lot slower than others, as a lottery ticket to an optimal mask was not found within the limited rounds of training. (b) Generally higher pruning level will lead to a higher loss. (c) By applying our design to increase the coverage index, models with identical architecture can reach a solution with lower loss for both selected pruning techniques without additional computational overhead (d) Relative error introduced by pruning is another key factor to the convergence. (e, f)Similar findings are also observed on such scheme with heterogeneous data.
Model FLOPs Space Ratio Accuracy
IID Non-IID
Local Global
FullNets 158.8K 1.27M 1.00 10 98.01 93.82 93.59
WP-L1 143.12K 1.15M 0.90 6 98.18 95.49 95.15
NP-L1 142.9K 1.14M 0.90 6 97.97 93.82 93.6
FS-L1 142.9K 1.14M 0.90 6 97.76 92.55 92.33
*WP-M1 135.5K 1.08M 0.85 8 98.39 95.82 95.48
WP-M2 135.5K 1.08M 0.85 4 97.51 89.29 89.13
*NP-M1 135.0K 1.08M 0.85 8 97.86 92.42 91.90
NP-M2 135.0K 1.08M 0.85 4 97.53 92.07 91.70
FS-M1 135.0K 1.08M 0.85 4 97.62 92.33 92.05
*WP-S1 100.0K 0.80M 0.63 5 95.32 81.64 81.66
WP-S2 100.0K 0.80M 0.63 5 95.10 72.19 71.64
*NP-S1 91.3K 0.73M 0.57 3 94.41 62.49 61.96
NP-S2 91.3K 0.73M 0.57 3 95.21 60.54 61.86
FS-S1 91.3K 0.73M 0.57 1 96.88 90.67 90.73
Table 1: FL with Different Pruning Techniques on MNIST
Model FLOPs
FLOPs
Ratio
Space
Space
Ratio
Accuracy
FullNets 653.8K 1.00 512.8K 1.00 10 53.63
WP-L1 619.6K 0.94 482.3K 0.94 8 53.12
FS-L1 619.6K 0.94 476.3K 0.93 8 53.08
WP-M1 587.0K 0.89 451.9K 0.89 6 52.66
*WP-M2 587.0K 0.89 451.9K 0.89 7 52.99
*WP-M3 587.0K 0.89 451.9K 0.89 8 54.20
FS-M1 585.5K 0.85 440.0K 0.89 6 51.87
WP-S1 553.7K 0.84 421.5K 0.82 4 51.69
*WP-S2 553.7K 0.84 421.5K 0.82 7 52.20
FS-S1 551.4K 0.84 403.5K 0.78 4 50.96
Table 2: FL with Different Pruning Techniques on CIFAR 10 (IID)

5.3 Impact on minimum coverage index

Our theory suggests that for a given pruning level, we expect that the minimum coverage index is a hyperbolic function of the global loss as theorem 1 indicates. Then for a given pruning level, a higher minimum coverage index may reduce the the standing point of convergence since contains a term , which could potentially lead to better performance. Note that existing pruning techniques and federated learning with heterogeneous models by pruning will always discard the partition in which the parameters are smaller than a certain threshold determined by the pruning policy and pruning level.

To illustrate the importance of such minimum coverage index, we consider the parameters of a model is argsorted based on certain pruning technique policy and then four sets are thereby generated representing the highest 25% partition to the lowest 25% partition: , where a mask generated based on existing pruning technique for a 75% sparsity model is then defined as . It is easy to see is then directly determined by the number of models with highest pruning levels, e.g. for experiment case 2: 40 % workers with full models and 60% workers with 75% pruned models.

To verify the impact of the minimum coverage index, we propose a way to increase the minimum coverage index without changing the network architecture or introducing extra computation overhead: by a simple way to increase the usage of parameters below pruning threshold: let partial local models train with pruned partition.

As an example shown in Fig 1, for model with code name *WP-M1, from which 2 out of 6 models with 75% pruned model using regular weights pruning technique, with the other 4 each two use with and with , so that is then achieved. We denote such design to maximize the minimum coverage index on current pruning techniques as STAR(*) in the results. For detailed case settings and corresponding pruning techniques see Appendix 2.

As shown in Figure 2(c), under the identical model setting with the same pruning level, both pruning techniques with different minimum coverage index show different convergence scenarios, specifically, the design with higher minimum coverage index is able to reach a solution with lower loss within the training round limits. It is also observed in table 1 and table 2 that they can reach higher accuracy with both IID and non-IID data, whereas for non-IID the improvements are more significant. There are even cases where settings under our design with fewer communication and computation costs that perform better than a regular design with more costs, e.g. ”*WP-M1” over ”WP-L1” on both IID and non-IID data. More examples and results can be found in Appendix.3

5.4 Impact on pruning-induced noise

As suggested in our theory, another key factor that contributes to the convergence is the pruning-induced noise . When the client model is treated with pruning or a sparse mask, then inevitably the pruning-induced noise will affect the convergency and model accuracy. Given the same minimum coverage index, generally, a smaller pruning-induced noise will lead to a lower convergence point and potentially a more accurate model.

For this phenomenon, we focus on the Fixed Sub-network method, which does not involve the adaptive changing of mask, and tested higher pruning levels as shown in figure 2(d) and confirms such a trend. As shown in Figure 2(b), all selected pruning methods are affected by the change of pruning level. In Figure 2(f), where model WP2-1 trains with a relatively steady trend, it becomes unsteady for model WP3-1, which could be due to the change of pruning mask before its local convergence. This may suggest that on high pruning level without a properly designed pruning technique, using fixed sub-network may bring a more robust and accurate model.

5.5 More Discussions and Empirical Findings

Besides on the verification of our theory, we have additionally noticed several phenomenon, of which some confirm previous research while some others may require further investigation for a theoretical support.

In Figure 2(a), under a similar pruning level, PT converges a lot slower than others with its pre-trained mask, as also suggested by previous works that models with static sparse parameters will converge to a solution with higher loss than models with dynamic sparse training. Also, it suggests that it is unlikely to find such a lottery ticket to an optimal mask within limited rounds of training, especially without a carefully designed algorithm.

Although generally a higher pruning level will result in a higher loss in training, different pruning techniques have different sensitivity towards it. As an example Fixed Sub-network has a relatively low sensitivity towards high pruning levels, this could be due to by using the static continuous mask will avoid the situation where pruning mask is changed before its local convergence, which makes it more stable on different pruning levels.

Nevertheless, for most cases our design to increase minimum coverage index could deliver improvement on increasing model accuracy and reducing the global loss, pruning-induced noise is another key factor to notice, especially with higher pruning levels such design to merely focus on increasing minimum coverage index may not bring significant improvements.

Finally, we also show a synthetic special case where (the proposed necessary conditions are not met)all local clients do not sum up a mask that covers the whole model in Appendix.4, which under this situation the model did not learn a usable solution.

6 Conclusion

In this paper, we establish (for the first time) the sufficient conditions for FL with heterogeneous local models and arbitrary adaptive pruning to converge to a stationary point of standard FL, at a rate of . It applies to general smooth cost functions and recovers a number of important FL algorithms as special cases. The analysis advocates designing pruning strategies with respect to both minimum coverage index and pruning-induced noise . We further empirically demonstrated the correctness of the theory and the performances of the proposed design. Our work provides a theoretical understanding of FL with heterogeneous clients and dynamic pruning, and presents valuable insights on FL algorithm design, which will be considered in future work.

References

Appendix A Proof of Theorems 1 and 2

a.1 Problem summary and notations

We summarize the algorithm in a way that can present the convergence analysis more easily. We use a superscript such as , , and

to denote the sub-vector of parameter, mask, and gradient corresponding to region

. In each round , parameters in each region is contained in and only in a set of local models denoted by , implying that for and otherwise. We define as the minimum coverage index, since it denotes the minimum number of local models that contain any parameters in . With slight abuse of notations, we use and to denote the the gradient and stochastic gradient, respectively.

Input: Local data on local workers, learning rate , pruning policy , number of local epochs , global model parameterized by .
Executes:
Initialize
for round do do
       for local workers do (In parallel) do
             Generate mask
             Prune
             Update local models:
             for epoch do  do
                   Update
             end for
            
       end for
       Update global model:
       for region do  do
             Find
             Update
       end for
      
end for
Output
Algorithm 2 Heterogeneous FL with adaptive online model pruning

a.2 Assumptions

Assumption 6.

(Smoothness). Cost functions are all L-smooth: and any , we assume that there exists :

(11)
Assumption 7.

(Pruning-induced Error). We assume that for some and any , the pruning-induced error is bounded by

(12)
Assumption 8.

(Bounded Gradient). The expected squared norm of stochastic gradients is bounded uniformly, i.e., for constant and any :

(13)
Assumption 9.

(Gradient Noise for IID data). Under IID data distribution, for any , we assume that

(14)
(15)

where is a constant and are independent samples for different .

Assumption 10.

(Gradient Noise for non-IID data). Under non-IID data distribution, we assume that for constant and any :

(16)
(17)

a.3 Convergence Analysis

We now analyze the convergence of heterogeneous FL under adaptive online model pruning with respect to any pruning policy (and the resulting mask ) and prove the main theorems in this paper. We need to overcome a number of challenges as follows:

  • We will begin the proof by analyzing the change of loss function in one round as the model goes from to , i.e., . It includes three major steps: pruning to obtain heterogeneous local models , training local models in a distributed fashion to update , and parameter aggregation to update the global model .

  • Due to the use of heterogeneous local models whose masks both vary over rounds and change for different workers, we first characterize the difference between (i) local model at any epoch and global model at the beginning of the current round. It is easy to see that this can be factorized into two parts: pruning induced error and local training , which will be analyzed in Lemma 1.

  • We characterize the impact of heterogeneous local models on global parameter update. Specifically, We use an ideal local gradient as a reference point and quantify the different between aggregated local gradients and the ideal gradient. This will be presented in Lemma 2. We also quantify the norm difference between a gradient and a stochastic gradient (with respect to the global update step) using the gradient noise assumptions, in Lemma 3.

  • Since IID and non-IID data distributions in our model differ in the gradient noise assumption (i.e., Assumption 4 and Assumption 5), we present a unified proof for both cases. We will explicitly state IID and non-IID data distributions only if the two cases require different treatment (when the gradient noise assumptions are needed). Otherwise, the derivations and proofs are identical for both cases.

We will begin by proving a number of lemmas and then use them for convergence analysis.

Lemma 3.

Under Assumption 2 and Assumption 3, for any , we have:

(18)
Proof.

We note that is the global model at the beginning of current round. We split the difference into two parts: changes due to local model training and changes due to pruning . That is

(19)

where we used the fact that in the last step.

For the first term in Eq.(19), we notice that is obtained from through epochs of local model updates on worker . Using the local gradient updates from the algorithm, it is easy to see:

(20)

where we use the fact that in step 2 above, and the fact that is a binary mask in step 3 above together with Assumption 3 for bounded gradient.

For the second term in Eq.(19), the difference is resulted by model pruning using mask of work in round . We have

(21)

where we used the fact that in step 1 above, and Assumption 2 in step 2 above.

Plugging Eq.(20) and Eq.(21) into Eq.(19), we obtain the desired result. ∎

Lemma 4.

Under Assumptions 1-3, for any , we have:

(22)
Proof.

Recall that is the number of local models containing parameters of region in round . The left-hand-side of Eq.(22) denotes the difference between an average gradient of heterogeneous models (through aggregation and over time) and an ideal gradient. The summation over adds up such difference over all regions , because the average gradient takes a different form in different regions.

From the inequality , we obtain . We use this inequality on the left-hand-side of Eq.(22) to get:

(23)

where we relax the inequality by choosing the smallest and changing the summation over to all workers in the second step. In the third step, we use the fact that gradient norm of a vector is equal to the sum of norm of all sub-vectors (i.e., regions ). This allows us to consider instead of its sub-vectors on different regions.

Finally, the last step is directly from L-smoothness in Assumption 1. Under Assumptions 2-3, we notice that the last step of Eq.(23) is further bounded by Lemma 1, which yields the desired result of this lemma after re-arranging the terms. ∎

Lemma 5.

For IID data distribution under Assumptions 4, for any , we have:

For non-IID data distribution under Assumption 5, for any , we have:

Proof.

This lemma quantifies the square norm of the difference between gradient and stochastic gradient in the global parameter update. We present results for both IID and non-IID cases in this lemma under Assumption 4 and Assumption 5, respectively.

We first consider IID data distributions. Since all the samples are independent from each other for different and , the difference between gradient and stochastic gradient, i.e., , are independent gradient noise. Due to Assumption 4, these gradient noise has zero mean. Using the fact that for zero-mean and independent ’s, we get: