LotteryFL: Personalized and Communication-Efficient Federated Learning with Lottery Ticket Hypothesis on Non-IID Datasets

08/07/2020 ∙ by Ang Li, et al. ∙ Duke University 12

Federated learning is a popular distributed machine learning paradigm with enhanced privacy. Its primary goal is learning a global model that offers good performance for the participants as many as possible. The technology is rapidly advancing with many unsolved challenges, among which statistical heterogeneity (i.e., non-IID) and communication efficiency are two critical ones that hinder the development of federated learning. In this work, we propose LotteryFL – a personalized and communication-efficient federated learning framework via exploiting the Lottery Ticket hypothesis. In LotteryFL, each client learns a lottery ticket network (i.e., a subnetwork of the base model) by applying the Lottery Ticket hypothesis, and only these lottery networks will be communicated between the server and clients. Rather than learning a shared global model in classic federated learning, each client learns a personalized model via LotteryFL; the communication cost can be significantly reduced due to the compact size of lottery networks. To support the training and evaluation of our framework, we construct non-IID datasets based on MNIST, CIFAR-10 and EMNIST by taking feature distribution skew, label distribution skew and quantity skew into consideration. Experiments on these non-IID datasets demonstrate that LotteryFL significantly outperforms existing solutions in terms of personalization and communication cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated learning (FL) McMahan et al. (2017) is a popular distributed machine learning framework that enables a number of clients to train a shared global model collaboratively without transferring their local data. A central server coordinates the FL process, where each participating client communicates only the model parameters with the central server while keeping local data private. In this way, FL overcomes privacy challenges and allows machine learning models to learn decentralized data. FL has been applied to many practical applications where data is distributed across clients and too sensitive to be aggregated into a central repository. For example, FL has demonstrated its good performance for the next-word-prediction task on smartphones Hard et al. (2018).

The clients that participate in the FL process expect to obtain a shared global model that can provide better performance than the models individually trained by themselves. In practice, the distribution of the data across the clients is inherently non-IID (non-identically independently distributed). Such statistical heterogeneity makes it difficult to train a shared global model that can be well generalized for all clients McMahan et al. (2017); Li et al. (2019). Many studies attempt to mitigate the statistical heterogeneity via performing personalization in FL, including exploiting meta-learning Jiang et al. (2019); Khodak et al. (2019); Chen et al. (2018), multi-task learning Smith et al. (2017); Zantedeschi et al. (2019)

, transfer learning

Wang et al. (2019); Mansour et al. (2020), etc. These approaches commonly require two separate steps: 1) training a global model collaboratively and 2) adapting the model to each client by using its local data. Such a two-step process for personalization inevitably induces extra overheads. Communication cost is another major bottleneck of FL as the communication links between the central server and the participating clients typically operate at a low rate and could be expensive. Thus, a straightforward approach to alleviating the communication bottleneck is compressing the data communicated between the server and the clients. Common practices include sparsification Konečnỳ et al. (2016), quantization Alistarh et al. (2017), etc. However, very few efforts have been spent on addressing these two critical challenges simultaneously, The only possible exception is LG-FedAvg, which is proposed by Liang et al. Liang et al. (2020). However, LG-FedAvg was built based on an unrealistic FL setting, where each client holds sufficient training data (300 images/class of MNIST and 250 images/class of CIFAR-10). Such a condition implies that a client has already been able to obtain a good performance by training a model locally, which is indeed against the motivation of FL. Instead, in this work, we consider a more realistic and challenging scenario, where each client owns limited data (e.g., 5 images/class). As such, none of the clients is able to train a local model with desired performance.

Our work: We design LotteryFL – a personalized and communication-efficient FL framework via exploiting the Lottery Ticket hypothesis Frankle and Carbin (2018). The Lottery Ticket hypothesis offers a simple solution to discover the Lottery Ticket Networks (LTNs), which are sparse subnetworks within a large base model. Surprisingly, the performance of these LTNs often exceeds that of the non-sparse base model given the same training efforts. Inspired by this property, we propose to seek the LTN of each client during each communication round, and then communicate only the parameters of LTNs between the clients and the server in FL. After aggregating the LTNs of the clients, the server will distribute the updated parameters of the corresponding LTN to each client. Finally, a personalized model, instead of a shared global model, will be learned at each client. Since the LTN is determined by pruning the base model using the local data of each client, the data-dependent features have already been incorporated in the LTN. Considering the non-IID data distribution across the clients, the LTN of each client may not significantly overlap with each other. Therefore, the personalization property of each LTN can be retained after the aggregation is performed on the server. Additionally, due to the compact size of the LTN, the size of the model parameters that needs to be communicated is reduced. The communication efficiency of FL can be significantly improved accordingly.

It is worth nothing that no benchmark datasets are yet delicately designed for supporting the research on FL under non-IID settings. Therefore, we also construct and publish several datasets111https://github.com/jeremy313/non-iid-dataset-for-personalized-federated-learning to represent the characteristics of practical FL environments under non-IID settings. These non-IID datasets are constructed based on three classical datasets: MNIST LeCun et al. (1995), CIFAR-10 Krizhevsky et al. (2009), and EMNIST Cohen et al. (2017). In order to quantitatively evaluate the degree of non-IID data distribution across clients, we define a metric named Client-Wise Non-IID Index (CNI). Based on CNI, we explore the impact of non-IID distribution on our model performance. Our key contributions be can summarized as follows:

  • We propose a novel FL framework, namely, LotteryFL, that can achieve both personalization and high communication efficiency under non-IID setting;

  • Based on the classical datasets, we construct several datasets to support FL under non-IID settings. We also define a metric – Client-Wise Non-IID Index, to quantitatively evaluate the degree of non-IID data distribution across clients.

  • We conduct experiments on our designed non-IID datasets and compare LotteryFL with two existing methods – FedAvg McMahan et al. (2017) and LG-FedAvg Liang et al. (2020). Experimental results shows that LotteryFL significantly outperforms the compared methods in terms of both personalization and communication cost.

2 Related Work

FL McMahan et al. (2017) is a distributed machine learning paradigm with enhanced privacy. The primary goal of FL is to learn a global model that achieves good performance for almost all participants. FedAvg proposed by McMahan et al. McMahan et al. (2017) is one of the most widely adopted FL methods, which adopts averaging as its aggregation method over the local models of participating clients. However, FL is still facing many challenges Kairouz et al. (2019); Li et al. (2019), among which statistical heterogeneity and communication efficiency are two critical ones that hinder the development of FL.

Personalization

Due to statistical heterogeneity (i.e., non-IID data distribution across clients), it is necessary to adapt the global model to achieve personalization. Existing works achieve personalization via meta-learning Jiang et al. (2019); Khodak et al. (2019); Chen et al. (2018), multi-task learning Smith et al. (2017); Zantedeschi et al. (2019), transfer learning Wang et al. (2019); Mansour et al. (2020), etc. However, all the existing works achieve personalization in two separate steps that are associated with extra overhead: 1) a global model is learned in a federated fashion, and 2) the global model is fine-tuned for each client using the local data.

Communication Efficiency

Communication is a major bottleneck for FL since the communication links between clients and the server typically operate at low rates and can be expensive. Some studies Alistarh et al. (2017); Konečnỳ et al. (2016); Ivkin et al. (2019) aim to reduce communication costs in FL. The key idea is to reduce the size of the data communicated between the server and clients via combining FedAvg with data compression techniques, e.g., sparsification, quantization, sketching, etc.

To the best of our knowledge, Liang et al. Liang et al. (2020) proposed the first FL method, namely LG-FedAvg, to address the above two challenges simultaneously. However, the problem setting in LG-FedAvg cannot represent the realistic FL environment. In this paper, we develop a personalized and communication-efficient FL method under a more realistic FL settings.

3 Design of LotteryFL

Figure 1: Overview of LotteryFL.

At high level, LotteryFL combines the Lottery Ticket hypothesis with FedAvg in an end-to-end manner. An overview of LotteryFL is shown in Figure 1. Each participating client learns to find a Lottery Ticket Network (LTN) via applying the Lottery Ticket hypothesis Frankle and Carbin (2018). In particular, the LTN is learned via pruning the base model using the local data of each client. Only the parameters of LTNs will be communicated between the clients and the server instead of the base model. The server then performs the aggregation over the received LTNs only, and the updated parameters of corresponding LTNs will be sent back to each client accordingly. The clients continue the training process after updating the parameters of LTNs. Before presenting the details of learning local LTNs and performing the global aggregation, we first define the following notations that are used in our paper.

Notations: We denote as available clients, where denotes the th client; and as a set of randomly selected clients in each training round. Let be the parameters of the base model on the global server, and () represent the local model parameters on each client . We also use the superscript , e.g., , to represent the model parameters learned in the -th round. Each client also learns a local mask , which indicates the LTN identified via applying the Lottery Ticket hypothesis. Therefore, denotes the parameters of the corresponding LTN at client . Given the data held by , we split into the training data , validation data , and test data .

3.1 Training Algorithm

Compared to FedAvg, the key difference is that only LTNs are communicated between the clients and the server in FL. As a result, the aggregation on the server is also performed on the LTNs only in each communication round. The details of the training algorithm for LotteryFL is elaborated in Algorithm 1. In general, the training algorithm has the following steps:

Step I: Given the th communication round, the server randomly samples a set of clients .

Step II: Each client downloads its corresponding LTN from the server, where . Here, is the local mask of , which indicates the LTN inside .

Step III: Each client starts training the local model with and evaluates on the validation data . If the validation performance is better than a predefined threshold and the current pruning rate does not reach the target pruning rate , the client will prune the small weights of using a fixed pruning rate . Once the pruning completes, the client can learn a mask to indicate the weights retained in . In fact, is the learned LTN of for the next communication round, and the data-dependent features has already been also integrated with the learned mask. According to the workflow in the Lottery Ticket hypothesis, we re-initialize the weights of the LTN (i.e., ) as the initial values of these weights in , which is used to randomly initialize at the beginning and stored on each client.

Step IV: Each client performs the mini-batch training for epochs using , and then the updated will be sent to the server.

Step V: The server performs aggregations on the LTNs only (i.e., ) via FedAvg, and updates the corresponding parameters in to obtain .

The above process repeats until reaches a predefined number of communication rounds. Finally, each client learns a personalized model .

initialize the global model with
available clients, random sampling rate
randomly selected participating clients indexed by
for each round  do
     for each client in parallel do
          the mask of th client indicates the corresponding LTN
         
     end for
      (aggregate LTNs )
end for
(evaluate with local validation data )
if  and  then is the current pruning rate of th client’s model, is the target pruning rate
      (prune with the fixed pruning rate to get a new mask for the LTN)
      (reset the masked parameters to corresponding values in )
end if
(split local data into batches)
for each local epoch from to  do
     for batch  do
          is the learning rate,

is the loss function

     end for
end for
return to server
Algorithm 1 Training Algorithm of LotteryFL.

4 Non-IID Datasets

In this section, we first construct a set of non-IID datasets that are designed to capture the characteristics of FL in practice. Then, we define a novel metric Client-Wise Non-IID Index to quantitatively measure the degree of non-IID data distribution across clients.

4.1 Non-IID Dataset Generation

We construct non-IID datasets based on classic datasets, such as MNIST LeCun et al. (1995), CIFAR-10 Krizhevsky et al. (2009) and EMNIST Cohen et al. (2017). Specifically, we primarily focus on feature distribution skew, label distribution skew and quantity skew, which are three major ways that make data be non-identically distributed across clients.

Feature distribution skew: It is common that data across clients have different features, which, howver, may correspond to the same label. Handwriting recognition is a typical example of feature distribution skew, where users write the same character but with different stroke width, slant, etc.

Label distribution skew: Given a specific label, each client may hold vastly different amount of data corresponding to this label. For example, when clients are tied to specific locations, the label distribution can be varied across clients – certain apparel is used by one demographic but not othersla.

Quantity skew: Given the dataset on a specific client, the amount of data between different labels can be significantly unbalanced.

In practice, the real datasets for FL may contain a mixture of the above effects. To strengthen the motivation for participating the FL, we assume each client does not own sufficient data to train a local model with the desired performance. Therefore, we design two specific configurations to construct non-IID datasets based on MNIST and CIFAR-10.

n-class & balanced. Each client holds -class training data and these classes can be different across clients. The data volume of each class is balanced, however, the volume of -class training data is limited such that it is infeasible to train a local model with good performance. In addition, the test data follow the same distribution as the training data. This configuration can represent the feature distribution skew case.

n-class & unbalanced. Each client holds -class training data and the -class can be varied across clients. The data volume of each class is unbalanced, but the amount of -class data is insufficient to locally train a model with good performance. The test data also follows the same distribution as the training data. This configuration stands for a mixture of feature distribution skew and quantity skew.

EMNIST, an extension of MNIST, represents a more challenging classification task involving both letters and digits. EMNIST shares the same image structure and parameters as the original MNIST. Moreover, each image in EMNIST is also associated with an attribute ‘By_Author’, which represents the writer of each image. Therefore, EMNIST can be naturally transformed to a non-IID dataset by grouping the handwritten characters images based on the writer. Then, each client holds only a specific writer’s handwritten characters images in our experiments.

4.2 Client-Wise Non-IID Index

To explore the impact of non-IID data distribution on the model performance, we need to quantitatively measure the degree of non-IID across clients. He et al. He et al. (2019) define a metric named Non-IID Index (NI) to quantify the distribution shift between training data and test data. However, applying NI to different datasets requires different trained feature extractors

and classifiers

, which are not practical in FL. We aim to quantify the degree of non-IID across clients in a simple and unified way. Specifically, we propose Client-Wise Non-IID Index (CNI), which replaces the feature extractors in NI with a fixed encoder .

Definition 1 (Client-Wise Non-IID Index (CNI)).

Given an encoder and a participating client in FL, the CNI is defined as:

(1)

where , denotes the data belonging to the th class in . is the number of classes in .

represents the first order moment,

is the standard deviation and is used to normalize the scale, and

indicates the -norm. The intuition behind Equation (1) is to measure the distance between the average data representations from different classes in feature space on a given client and the counterpart over all the other clients.

5 Evaluation

5.1 Experiment Setup

In our experiments, we evaluate the performance of LotteryFL on our constructed non-IID datasets in terms of personalization and communication cost.

Dataset 5 samples/class 10 samples/class 20 samples/class
MNIST IID 5.55 4.33 3.37
2-class & balanced 7.38 6.14 5.59
2-class & unbalanced(0.75) 7.69 6.41 5.74
2-class & unbalanced(0.5) 8.01 6.71 5.89
2-class & unbalanced(0.25) 9.31 7.56 6.10
CIFAR10 IID 13.65 10.89 7.91
2-class & balanced 15.20 11.70 9.16
2-class & unbalanced(0.75) 16.85 12.43 9.45
2-class & unbalanced(0.5) 18.74 13.51 10.43
2-class & unbalanced(0.25) 22.90 17.14 12.29
Table 1: The CNI values on constructed non-IID datasets under different settings.

Non-IID datasets

To construct non-IID datasets based on MNIST and CIFAR-10, we transform these two datasets via applying our proposed two configurations in Section 4.1. We respectively specify those two configurations as 2-class & balanced and 2-class & unbalanced in our experiments. In addition, to accurately quantify the degree of unbalances between the two classes on a client, we also define an balance rate as the ratio between the data volume of one class and the counterpart of the other class. We use the parenthesized number to indicate the balance rate. For example, ‘unbalance(0.25)’ indicates that data volume of one class is 25% of the other class. To explore the impact of unbalances on the model performance, we vary the balance rate with {0.25, 0.5, 0.75} when constructing non-IID datasets. We also specify the number of samples per class on a client between . We assume the MNIST and CIFAR-10 datasets are distributed across 400 clients. Moreover, we transform EMNIST to a non-IID dataset by grouping the handwritten characters images based on the writer. In particular, the EMNIST dataset is distributed across 2424 clients.

Compared methods

We compare LotteryFL with three methods, i.e., FedAvg McMahan et al. (2017), Standalone, and LG-FedAvg Liang et al. (2020). FedAvg is a classic FL approach. In each communication round, randomly selected clients download the latest global model from the server, and then train the model using local data. Finally, the clients upload the updated models to the server and the server performs the aggregation over the updated models. Standalone aims to train a model locally by each client. LG-FedAvg is a two-step FL approach to achieve personalization, where the global model is learned first, and then each client fine-tunes the global model using local data to achieve personalization.

Parameter setting

We set the hyperparameters

= 10, = 32, = 0.2, and = 0.5 in Algorithm 1. The same settings are also applied to the compared methods if needed. We vary the target pruning rate . We randomly selected the number of participating clients from in each communication round. The details of model architectures are presented in our supplementary material.

Evaluation metric

We adopt the classification accuracy of each client’s test data to evaluate the performance of personalization, and report averaged accuracy over all clients. We use the data volume communicated between the clients and the server to measure communication costs.

5.2 Comparison of CNI on Non-IID Datasets

We adopt VGG16Simonyan and Zisserman (2014)

pretrained on ImageNet 

Deng et al. (2009) as the fixed encoder . Table 1 reports the averaged CNI over all clients on MNIST and CIFAR-10. As a comparison, we also show CNI values on IID datasets, where each client holds images from all classes and each class has the same number of samples. We observe that CNI values of non-IID datasets are much higher than those of IID datasets. Moreover, when the balance rate decreases (i.e., a higher degree of unbalance), the CNI becomes larger.

5.3 Comparison of Personalization and Communication Efficiency

Impact of the number of participating clients per round

In this experiment, we evaluate LotteryFL on non-IID datasets under the 2-class balanced setting, where each class consists of 20 samples. We perform the training on MNIST for 400 communication rounds, and for 2000 communication rounds on CIFAR-10 and EMNIST. Table 2 shows the results. We have the following observations: First, LotteryFL can achieve the best personalization with the lowest communication cost on all the three non-IID datasets. For example, when there are 5 participating clients per round, LotteryFL(0.3) can achieve an accuracy of 89.70% on CIFAR-10 non-IID dataset, which is 14.18% and 43.5% higher than that of LG-FedAvg and FedAvg, respectively. Morever, LG-FedAvg and FedAvg respectively consume 1.38X and 1.81X communication costs compared to LotteryFL. Standalone method performs the worst in most cases due to insufficient training data on each client. Second, if more clients participate in the training in each communication round, the personalization performance of all FL methods are improved. The reason is that more involved clients contribute to a larger number of training data in FL.

Dataset Method 5 clients 10 clients 20 clients
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
MNIST Standalone 90.17 0 90.25 0 90.72 0
FedAvg 96.02 165.94 96.14 331.89 96.46 663.76
LG-FedAvg 96.74 131.00 97.76 262.01 97.87 524.02
LotteryFL(0.1) 99.67 98.18 99.86 141.18 99.96 163.57
LotteryFL(0.3) 99.93 98.91 99.94 140.30 99.95 230.17
LotteryFL(0.5) 99.86 103.00 99.93 165.21 99.96 304.54
CIFAR10 Standalone 64.89 0 65.12 0 65.44 0
FedAvg 46.20 2356.34 46.43 4712.68 47.67 9425.35
LG-FedAvg 75.52 1793.64 75.89 3587.29 76.77 7174.58
LotteryFL(0.1) 88.81 1142.45 90.04 1596.57 90.61 2439.56
LotteryFL(0.3) 89.70 1298.19 90.45 1977.41 89.91 3560.58
LotteryFL(0.5) 89.68 1448.24 90.30 2439.93 90.53 4501.36
EMNIST Standalone 65.26 0 65.69 0 65.75 0
FedAvg 82.08 20551.83 82.44 41103.67 83.06 82207.34
LG-FedAvg 87.35 16030.43 87.47 32060.86 87.97 64121.72
LotteryFL(0.1) 92.38 10864.41 93.43 15273.75 94.27 19978.86
LotteryFL(0.3) 92.44 10986.62 93.45 17180.02 94.18 28351.86
LotteryFL(0.5) 92.56 11616.41 93.54 20327.13 94.90 37579.86
Table 2: Comparison of personalization and communication cost with different number of participating clients in each communication round.
Dataset Method 5samples/class 10samples/class 20samples/class
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
MNIST Standalone 86.84 0 89.11 0 90.72 0
FedAvg 94.06 663.76 94.16 663.76 96.46 663.76
LG-FedAvg 96.35 524.02 96.98 524.02 97.87 524.02
LotteryFL(0.1) 98.95 173.02 99.16 168.65 99.96 163.57
LotteryFL(0.3) 99.42 240.73 99.59 236.92 99.95 230.17
LotteryFL(0.5) 99.38 310.47 99.54 308.35 99.96 304.54
CIFAR10 Standalone 59.55 0 64.06 0 65.44 0
FedAvg 37.62 9425.35 43.20 9425.35 47.67 9425.35
LG-FedAvg 70.69 7174.58 72.09 7174.58 76.77 7174.58
  LotteryFL(0.1) 85.97 3832.02 87.31 3069.95 90.61 2439.56
LotteryFL(0.3) 84.00 4951.63 87.89 3906.42 89.91 3560.58
LotteryFL(0.5) 83.03 5739.96 88.17 4856.26 90.53 4501.36
Table 3: Comparison of personalization and communication cost with different number of samples for each class on the clients.

Impact of the data volume on each client

In this experiment, we evaluate the performance of LotteryFL with different number of samples per class on the clients. We keep 20 participating clients in each communication round. Table 3 shows the results. Similar to the results in Table 2, we observe that LotteryFL obtains the best performance on personalization with the lowest communication cost. For example, when each client holds 20 samples per class, LotteryFL(0.1) can reach an accuracy of 90.61% on CIFAR-10 non-IID dataset, while the accuracy of Standalone, FedAvg, and LG-FedAvg are 65.44%, 47.67%, and 76.77%, respectively. Meanwhile, FedAvg and LG-FedAvg consume 3.86X and 2.94X communication costs, respectively, compared with LotteryFL. Moreover, as the number of samples per class increases, the performance of LotteryFL is also improved. For example, when we increase the number of samples per class from 5 to 10, the accuracy of LotteryFL(0.5) increases from 83.03% to 88.17% on CIFAR-10 non-IID dataset. Another interesting observation is that, as the number of samples per class increases, the communication cost reduces accordingly. For example, the communication cost for LotteryFL(0.1) on CIFAR-10 non-IID dataset is 3832.02Mb when there are 5 samples per class, while it reduces to 2439.56Mb when the number increases to 20. The reason is that more samples can speedup the convergence of the training, and the base model can be pruned more frequently on each client. A more compact model generates less data to be communicated with the server.

Impact of the balance rate

Besides, we also explore the impact of the balance rate on the performance of LotteryFL. In this experiment, we keep the settings as the same as the above two experiments but specify at most 20 samples for one class on the clients. For instance, if the balance rate is 0.25, then one class contains 1-20 samples and the other one has 5 samples. The results in Table 4 also show that LotteryFL can significantly improve personalization and communication efficiency together under such challenging settings. In general, the performance of LotteryFL and the compared methods on balanced non-IID dataset is better than that under unbalanced setting. However, within all unbalanced setting, LotteryFL always outperforms the compared methods. For example, when we set balance rate to 0.25 (the worst case), LotteryFL(0.25) can reach an accuracy of 85.29% on CIFAR-10 non-IID dataset. This number is 34.96%, 45.1%, and 16.26% higher than that is achieved by Standalone, FedAvg and LG-Fedavg, respectively. Meanwhile, the communication cost of LotteryFL is reduced by 94% and 48% compared with FedAvg and LG-FedAvg, respectively.

Dataset Method balanced unbalanced(0.75) unbalanced(0.5) unbalanced(0.25)
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
Acc
(%)
Communication
cost (MB)
MNIST Standalone 90.72 0 88.78 0 88.04 0 61.82 0
FedAvg 96.46 663.76 94.52 663.76 93.71 663.76 93.13 663.76
LG-FedAvg 97.87 524.02 97.26 524.02 95.87 524.02 95.16 524.02
LotteryFL(0.1) 99.96 163.57 99.13 170.40 98.75 175.22 98.27 183.61
LotteryFL(0.3) 99.95 230.17 99.40 240.02 99.33 244.58 98.96 246.58
LotteryFL(0.5) 99.96 304.54 99.55 311.64 99.33 315.20 99.15 319.03
CIFAR10 Standalone 65.44 0 58.25 0 55.60 0 50.33 0
FedAvg 47.67 9425.35 44.12 9425.35 43.04 9425.35 40.19 9425.35
LG-FedAvg 76.77 7174.58 75.19 7174.58 72.81 7174.58 69.03 7174.58
LotteryFL(0.1) 90.61 2439.56 88.93 2591.93 88.53 2612.29 84.49 2973.22
LotteryFL(0.3) 89.91 3560.58 88.71 3655.73 88.07 3781.99 85.25 3931.81
LotteryFL(0.5) 90.53 4501.36 89.40 4683.48 87.42 4750.81 85.29 4848.59
Table 4: Comparison of personalization and communication cost with different balance rates.

5.4 Behavior of Personalization

To better understand how LotteryFL realizes personalization, we investigate the ratio of parameters that are unique among each client’s personalized model. To this end, we define the parameters that are shared by less than 10% of clients as the personalized parameters, and we visualize the distributions of the personalized parameters among each client’s personalized model. The averaged results over all participating clients are shown in Figure 2.

(a) MNIST
(b) CIFAR-10
Figure 2: The distributions of personalized parameters across different layers. (‘2-class balanced’, 20 samples/class, 20 participating clients in each communication round)

As Figure 2 shows, with an increasing target pruning rate, the percentage of the personalized parameters of each layer will be higher. Such a phenomenon indicates that applying the iterative pruning based on the Lottery Ticket hypothesis can remove the commonly shared parameters of each layer, but retain the personalized parameters that can represent the features of local data. For example, when we set the target pruning rate to 0.1, there are as high as 93% and 95% personalized parameters in the first convolutional layer and the first fully connected layer of the learned models, which are trained using the CIFAR-10 non-IID dataset. This also explains why the aggregation on the server does not degrade the performance of LotteryFL, since only a small portion of the parameters are overlapped between different clients’ LTNs.

6 Conclusion

We design LotteryFL – a personalized and communication-efficient FL framework under non-IID settings, which is inspired by the Lottery Ticket hypothesis. We also construct and publish well-designed datasets to support FL under non-IID settings, which could facilitate the research of robust FL under more parasitical environments. In addition, we define CNI, which is the first metric to quantitatively evaluate the degree of non-IID data distribution across clients. The experimental results on non-IID datasets demonstrate that LotteryFL significantly outperforms three compared methods in terms of personalization and communication cost.

Broader Impact

Any organizations or institutions can be considered as the clients in FL. For example, hospitals hold a huge amount of patient data for intelligent healthcare. However, under restrict privacy regulations or ethical constraints, hospitals may be required to keep the data local. FL is a promising solution for such applications, as it can enable the collaborative learning without compromising privacy. However, some works Zhu et al. (2019); Zhao et al. (2020) have shown that it is feasible to recover the private training data by eavesdropping the data communicated between the clients and the server. Since only the parameters of each client’s LTN will be communicated in LotteryFL and the LTN is dynamically changing during the training process, LotteryFL offers a potential solution to further enhance the privacy under such attacks.

References

  • D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §1, §2.
  • F. Chen, Z. Dong, Z. Li, and X. He (2018) Federated meta-learning for recommendation. arXiv preprint arXiv:1802.07876. Cited by: §1, §2.
  • G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik (2017) EMNIST: extending mnist to handwritten letters. In

    2017 International Joint Conference on Neural Networks (IJCNN)

    ,
    pp. 2921–2926. Cited by: §1, §4.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §5.2.
  • J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §3.
  • A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage (2018) Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604. Cited by: §1.
  • Y. He, Z. Shen, and P. Cui (2019) Towards non-iid image classification: a dataset and baselines. arXiv preprint arXiv:1906.02899. Cited by: §4.2.
  • N. Ivkin, D. Rothchild, E. Ullah, I. Stoica, R. Arora, et al. (2019) Communication-efficient distributed sgd with sketching. In Advances in Neural Information Processing Systems, pp. 13144–13154. Cited by: §2.
  • Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan (2019) Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488. Cited by: §1, §2.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §2.
  • M. Khodak, M. F. Balcan, and A. S. Talwalkar (2019) Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pp. 5915–5926. Cited by: §1, §2.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1, §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §1, §4.1.
  • Y. LeCun, L. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, et al. (1995) Learning algorithms for classification: a comparison on handwritten digit recognition. Neural networks: the statistical mechanics perspective 261, pp. 276. Cited by: §1, §4.1.
  • T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2019) Federated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873. Cited by: §1, §2.
  • P. P. Liang, T. Liu, L. Ziyin, R. Salakhutdinov, and L. Morency (2020) Think locally, act globally: federated learning with local and global representations. arXiv preprint arXiv:2001.01523. Cited by: 3rd item, §1, §2, §5.1.
  • Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh (2020) Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619. Cited by: §1, §2.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: 3rd item, §1, §1, §2, §5.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.2.
  • V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §1, §2.
  • K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays, and D. Ramage (2019) Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252. Cited by: §1, §2.
  • V. Zantedeschi, A. Bellet, and M. Tommasi (2019) Fully decentralized joint learning of personalized models and collaboration graphs. Cited by: §1, §2.
  • B. Zhao, K. R. Mopuri, and H. Bilen (2020) IDLG: improved deep leakage from gradients. arXiv preprint arXiv:2001.02610. Cited by: Broader Impact.
  • L. Zhu, Z. Liu, and S. Han (2019) Deep leakage from gradients. In Advances in Neural Information Processing Systems, pp. 14747–14756. Cited by: Broader Impact.