Federate Learning (FL) is an widely used machine learning framework that learns a global model across multiple decentralized clients. The clients independently train local models using their own data and send the local trained models to a server to aggregate a global model without sharing its own data. Due to the advantages in communication-efficient and privacy-preserving, FL has shown its potential to facilitate real-world applications, such as healthcare, biometrics, computer vision and natural language processing.
Under federated training framework, training methods that allow for flexible local updating and low client participate. FedAvg aistats/McMahanMRHA17 is one classic training algorithm in FL, which directly averages of the participate clients’ model updates and send back the aggregated model to the clients. The client can randomly choose whether to join this round of training. The model convergence can be speed up by this loose aggregation framework. The key challenges in federated learning are system heterogeneity (communication delay and various device) and data heterogeneity (structure different and non-i.i.d data).
For data structure heterogeneous problem, knowledge distillation is one solution in FL, where the heterogeneous models in each clients can be aggregated into one model. The global model as student learns knowledge from multiple client’s model as teacher.
For non-i.i.d data problem, only a few effects have been made, such as FedProx, SCAFFOLD, and FedNova. According to these studies, the non-i.i.d data settings can degrade the effectiveness of all machine learning processing. This is because each client only find an local optimized model based on local non-i.i.d dataset. The aggregated global model can not generate a good solution when the server does not receive enough participate clients’ updates.
To over come the above problem, we propose a novel global knowledge distillation method, named FedGKD, which learns the knowledge from past global models to tackle down the local bias training problem. By learning from global knowledge and consistent with current local models, FedGKD learns a global knowledge model in FL.
Overall, the primary contributions of this paper are summarized as follows:
To the best of knowledge, this paper is the first work that uses knowledge distillation technique to transfer the historical global models information to local model training for preventing over-biased local model training on non-i.i.d data distribution.
We provide a generalization method for federated knowledge distillation based on projection. In some experimental settings, we can see that projection could further help to improve the performance of the federated learning.
We show in extensive experiments and analysis on various dataset (i.e. CIFAR-10/100) and settings (i.e. non-i.i.d data distribution) validate the effectiveness of our proposed FedGKD, when comparing with several state-of-the-art methods.
2 Notations and Preliminaries
In general, FL aims to learn a global model weight that minimizes its risk on each of the user tasks mcmahan2017communication:
where is the collection of user tasks. In most cases, FL assumes the same task (i.e., ) but with different data distribution on each local side, then Eq. 1 is empirically optimized by , where is the empirical risk over each local dataset . In FL, a global data is distributed to multiple local clients, with .
KD is referred as teacher-student paradigm that a cumbersome but powerful teacher model transfers its knowledge through distillation to a lightweight student model buci2006model
. One most popular approach in machine learning is KD minimizes the discrepancy between the logits outputs from the teacher model and the student model with a proxy datasethinton2015distilling
. The discrepancy could be measured by Kullback-Leibler divergence:
where is the output of an non-linear complex function , such as logits, softmax outputs, etc.
The core idea of KD has been used for FL to solve the user heterogeneity problem, also well known as non-i.i.d data distribution problem lin2020ensemble; chen2020fedbe. These approaches treat the local models as teachers and transfer their knowledge into a student (global) model to improve its generalization performance. However, former works require a proxy dataset which is not practical in most scenarios. Then, we propose our new learning approach that tackles the problem in more practical settings, such as no proxy dataset, no additional training on the global side, and no additional privacy leakage besides sharing local gradients.
3 Local Training via Global Knowledge Distillation
In this section, we first introduce the proposed Federated Local Training via Global Knowledge Distillation (FedGKD) and its key features.
FedGKD is proposed for transferring the global knowledge to local model training, where each client’s data is in non-i.i.d data distribution. Due to the natural non-i.i.d data distribution, the local model training will strongly bias on its own dataset, which cause the complexity and difficulty of global aggregation. In order to mitigate the data and information gaps between each local model due to the data bias, a novel idea stems from the knowledge distillation technique that transfers the past global models information to local model training for preventing over-biased local training. Mathematically, we formulate the loss of FedGKD as a linear combination of original cross-entropy loss and the distillation losses:
is the softmax probability of each local client model,is the softmax probability of the former global model at round, and is the one-hot label. The distillation loss is defined by Kullback-Leibler divergence same as Eq. 2
. Note that, in our experimental evaluation, we find that using the global model in the last round already can achieve a good performance, which is also easier to set up the hyperparameter. However, we believe that some adaptive global model selection approaches could improve performance after the model converged.
4.1 Experiment Setup
We evaluate different state-of-the-art FL methods on CIFAR-10 and CIFAR-10 in different FL scenarios. ResNet-8 is used to validate the methods. Considering Batch Normalization(BN) fails on heterogeneous training data, we replace BN by Group Normalization(GN) to produce stabler results and alleviate some of the quality loss brought by BN.
We follow prior work,using the Dirichlet distribution to create disjoint non-i.i.d. client traning data. The value of controls the degree of non-i.i.d.-ness, the smaller indicates higher data heterogeneity.
Dataset and Evaluation Metrics
We evaluate the learning of different SOTA FL methods on CV task on architectures of ResNet cvpr/HeZRS16. We consider federated learning CIFAR-10/100 krizhevsky2009learning.
FedAvg: FedAvg aistats/McMahanMRHA17 is a classic federated learning algorithm, which directly use averaging as aggregation method.
FedProx: FedProx li2018federated improves the local objective based on FedAvg. It directly limits the size of local updates to solve non-i.i.d problem.
MOON: MOON li2021model is a contrastive federated learning method.
FedGKD: The proposed method in Algorithm 1.
The FL algorithm randomly samples a fraction of clients per communication round for local training. The optimizer used is SGD,the learning rate in our experiments is set as 0.05, and weight decay is 1e-5, momentum is 0.9.
4.2 Evaluation on the Common Federated Learning Settings
As shown in Table 1 and Table 2, the proposed method FedGKD outperforms previous state-of-the-art methods. FedGKT shows superiority when participant ratio is low. This is because when participants are few, the local optimal models will greatly affect the global aggregation.This characteristic helps in the real FL scenarios.
|=0.1||30.890.91||31.320.41||27.75 0.41||33.63 0.36|
5 Related Work
In this section, we investigate several related works from the perspective of federated learning and knowledge distillation.
FL, which distributes the overhead of training massive machine learning models to a set of low-computational edge-computational resources has emerged as a promising alternative ML paradigm mcmahan2016communication; mcmahan2016federated. Meanwhile, FL has the ability of jointly training a machine learning model without sharing their private training data mcmahan2016communication; mcmahan2016federated; bonawitz2017practical, therefore the data privacy will be preserved. In FL, one critical problem is on handling the unbalanced, non-independent and identically distributed (non-I.I.D.) data, which are very common in the real world zhao2018federated; sattler2019robust; li2019convergence. In the application level, FL has been widely applied into a wide range of real applications such as next word prediction mcmahan2017communication; mcmahan2018learning, visual object detection for safety FedVision, entity resolution hardy2017private, and medical prediction xu2021fedmood etc.
Knowledge distillation bucilu2006model; hinton2015distilling
is originally proposed for model compression by mimicking the predictions of the target large deep neural network. The learning process is essential analogous to the learning process of human beings, i.e., a large “teacher” model transfers its knowledge to the small “student” model. Recently, many studies start to work on transferring multiple teachers’ knowledge to a single student modelhamm2016learning; papernot2017semi. In addition to teacher-student learning, knowledge distillation is also extended to other learning and tasks, such as mutual learning zhang2018deep, assistant teaching mirzadeh2020improved, lifelong learning zhai2019lifelong, and self-learning yuan2019revisit.
In this paper, we have proposed FedGKD, a novel global knowledge distillation method to utilize the knowledge learnt from past global models , so as to mitigate the local bias issue. In particular, FedGKD learns a global knowledge model in FL by learning from global knowledge, as well as keeping consistent with current local models. To validate the effectiveness of the proposed method, we have conducted extensive experiments on CIFAR-10/100 datasets under the setting of non-i.i.d. Experimental results show that FedGKD performs better when comparing with several state-of-the-art methods.
Appendix A Appendix
Optionally include extra information (complete proofs, additional experiments and plots) in the appendix. This section will often be part of the supplemental material.