Global Knowledge Distillation in Federated Learning

Knowledge distillation has caught a lot of attention in Federated Learning (FL) recently. It has the advantage for FL to train on heterogeneous clients which have different data size and data structure. However, data samples across all devices are usually not independent and identically distributed (non-i.i.d), posing additional challenges to the convergence and speed of federated learning. As FL randomly asks the clients to join the training process and each client only learns from local non-i.i.d data, which makes learning processing even slower. In order to solve this problem, an intuitive idea is using the global model to guide local training. In this paper, we propose a novel global knowledge distillation method, named FedGKD, which learns the knowledge from past global models to tackle down the local bias training problem. By learning from global knowledge and consistent with current local models, FedGKD learns a global knowledge model in FL. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on various CV datasets (CIFAR-10/100) and settings (non-i.i.d data). The evaluation results show that FedGKD outperforms previous state-of-the-art methods.


page 1

page 2

page 3

page 4


Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning

Federated Learning (FL) is an emerging distributed learning paradigm und...

No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices

Federated learning (FL) is an important paradigm for training global mod...

FedRAD: Federated Robust Adaptive Distillation

The robustness of federated learning (FL) is vital for the distributed t...

FedAug: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data

Federated Learning (FL) is a machine learning paradigm that learns from ...

One-shot Federated Learning without Server-side Training

Federated Learning (FL) has recently made significant progress as a new ...

CD^2-pFed: Cyclic Distillation-guided Channel Decoupling for Model Personalization in Federated Learning

Federated learning (FL) is a distributed learning paradigm that enables ...

FedDTG:Federated Data-Free Knowledge Distillation via Three-Player Generative Adversarial Networks

Applying knowledge distillation to personalized cross-silo federated lea...

1 Introduction

Federate Learning (FL) is an widely used machine learning framework that learns a global model across multiple decentralized clients. The clients independently train local models using their own data and send the local trained models to a server to aggregate a global model without sharing its own data. Due to the advantages in communication-efficient and privacy-preserving, FL has shown its potential to facilitate real-world applications, such as healthcare, biometrics, computer vision and natural language processing.

Under federated training framework, training methods that allow for flexible local updating and low client participate. FedAvg aistats/McMahanMRHA17 is one classic training algorithm in FL, which directly averages of the participate clients’ model updates and send back the aggregated model to the clients. The client can randomly choose whether to join this round of training. The model convergence can be speed up by this loose aggregation framework. The key challenges in federated learning are system heterogeneity (communication delay and various device) and data heterogeneity (structure different and non-i.i.d data).

For data structure heterogeneous problem, knowledge distillation is one solution in FL, where the heterogeneous models in each clients can be aggregated into one model. The global model as student learns knowledge from multiple client’s model as teacher.

For non-i.i.d data problem, only a few effects have been made, such as FedProx, SCAFFOLD, and FedNova. According to these studies, the non-i.i.d data settings can degrade the effectiveness of all machine learning processing. This is because each client only find an local optimized model based on local non-i.i.d dataset. The aggregated global model can not generate a good solution when the server does not receive enough participate clients’ updates.

To over come the above problem, we propose a novel global knowledge distillation method, named FedGKD, which learns the knowledge from past global models to tackle down the local bias training problem. By learning from global knowledge and consistent with current local models, FedGKD learns a global knowledge model in FL.

Overall, the primary contributions of this paper are summarized as follows:

  • To the best of knowledge, this paper is the first work that uses knowledge distillation technique to transfer the historical global models information to local model training for preventing over-biased local model training on non-i.i.d data distribution.

  • We provide a generalization method for federated knowledge distillation based on projection. In some experimental settings, we can see that projection could further help to improve the performance of the federated learning.

  • We show in extensive experiments and analysis on various dataset (i.e. CIFAR-10/100) and settings (i.e. non-i.i.d data distribution) validate the effectiveness of our proposed FedGKD, when comparing with several state-of-the-art methods.

2 Notations and Preliminaries

Federated Learning

In general, FL aims to learn a global model weight that minimizes its risk on each of the user tasks  mcmahan2017communication:


where is the collection of user tasks. In most cases, FL assumes the same task (i.e., ) but with different data distribution on each local side, then Eq. 1 is empirically optimized by , where is the empirical risk over each local dataset . In FL, a global data is distributed to multiple local clients, with .

Knowledge Distillation

KD is referred as teacher-student paradigm that a cumbersome but powerful teacher model transfers its knowledge through distillation to a lightweight student model buci2006model

. One most popular approach in machine learning is KD minimizes the discrepancy between the logits outputs from the teacher model and the student model with a proxy dataset


. The discrepancy could be measured by Kullback-Leibler divergence:


where is the output of an non-linear complex function , such as logits, softmax outputs, etc.

The core idea of KD has been used for FL to solve the user heterogeneity problem, also well known as non-i.i.d data distribution problem lin2020ensemble; chen2020fedbe. These approaches treat the local models as teachers and transfer their knowledge into a student (global) model to improve its generalization performance. However, former works require a proxy dataset which is not practical in most scenarios. Then, we propose our new learning approach that tackles the problem in more practical settings, such as no proxy dataset, no additional training on the global side, and no additional privacy leakage besides sharing local gradients.

3 Local Training via Global Knowledge Distillation

In this section, we first introduce the proposed Federated Local Training via Global Knowledge Distillation (FedGKD) and its key features.

1:Notations. total number of clients , server , total communication rounds

, local epochs

, fraction of participating clients , learning rate , is a set holding client’s data sliced into batches of size .
3://On the server side
4:procedure ServerExecution Server Model Aggregation
5:     Initial global weight
6:     for each communication round  do
7:         Send the global weight to all clients
8:          random sample a set of clients      
9:     for each client in parallel do
12://On the data owners side
13:procedure ClientUpdate() Local Model Training
15:     for each local epoch  do
16:         for each batch  do
17:               Update Local Weights via Eq.3               
19:     Return back to server
Algorithm 1 FedGKD: Global Knowledge Distillation in Federated Learning

FedGKD is proposed for transferring the global knowledge to local model training, where each client’s data is in non-i.i.d data distribution. Due to the natural non-i.i.d data distribution, the local model training will strongly bias on its own dataset, which cause the complexity and difficulty of global aggregation. In order to mitigate the data and information gaps between each local model due to the data bias, a novel idea stems from the knowledge distillation technique that transfers the past global models information to local model training for preventing over-biased local training. Mathematically, we formulate the loss of FedGKD as a linear combination of original cross-entropy loss and the distillation losses:



is the softmax probability of each local client model,

is the softmax probability of the former global model at round, and is the one-hot label. The distillation loss is defined by Kullback-Leibler divergence same as Eq. 2

. Note that, in our experimental evaluation, we find that using the global model in the last round already can achieve a good performance, which is also easier to set up the hyperparameter. However, we believe that some adaptive global model selection approaches could improve performance after the model converged.

4 Experiment

4.1 Experiment Setup

We evaluate different state-of-the-art FL methods on CIFAR-10 and CIFAR-10 in different FL scenarios. ResNet-8 is used to validate the methods. Considering Batch Normalization(BN) fails on heterogeneous training data, we replace BN by Group Normalization(GN) to produce stabler results and alleviate some of the quality loss brought by BN.

We follow prior work,using the Dirichlet distribution to create disjoint non-i.i.d. client traning data. The value of controls the degree of non-i.i.d.-ness, the smaller indicates higher data heterogeneity.

Figure 1: Illustration of # of samples per class allocated to each client
Figure 2: Illustration of convergence rate

Dataset and Evaluation Metrics

We evaluate the learning of different SOTA FL methods on CV task on architectures of ResNet cvpr/HeZRS16. We consider federated learning CIFAR-10/100 krizhevsky2009learning.


  • FedAvg: FedAvg aistats/McMahanMRHA17 is a classic federated learning algorithm, which directly use averaging as aggregation method.

  • FedProx: FedProx li2018federated improves the local objective based on FedAvg. It directly limits the size of local updates to solve non-i.i.d problem.

  • MOON: MOON li2021model is a contrastive federated learning method.

  • FedGKD: The proposed method in Algorithm 1.

Implementation Details

The FL algorithm randomly samples a fraction of clients per communication round for local training. The optimizer used is SGD,the learning rate in our experiments is set as 0.05, and weight decay is 1e-5, momentum is 0.9.

4.2 Evaluation on the Common Federated Learning Settings


As shown in Table 1 and Table 2, the proposed method FedGKD outperforms previous state-of-the-art methods. FedGKT shows superiority when participant ratio is low. This is because when participants are few, the local optimal models will greatly affect the global aggregation.This characteristic helps in the real FL scenarios.

Method C=0.2 C=0.4 C=0.6
Best Final Best Final Best Final
FedAvg 77.300.72 74.701.09 78.850.32 77.280.42 79.510.34 79.240.34
FedProx 76.390.59 75.450.71 77.45 76.350.68 77.680.15 77.460.28
MOON 78.600.77 76.271.61 79.030.14 77.100.63 79.490.53 79.450.45
FedGKD 79.45 78.170.38 79.360.72 78.060.45 79.60.10 79.390.18
FedGKD-PJG 78.580.68 77.950.66 79.380.33 78.730.40 79.80.26 79.720.29
Table 1: Comparison of different FL methods in different sampling ratio, we evaluate on CIFAR-10 with ResNet-8
Dataset Setting FedAvg FedProx MOON FedGKD
CIFAR-10 =1 74.701.09 75.450.71 76.271.61 77.950.66
=0.1 41.944.59 63.430.36 45.675.02 64.002.83
CIFAR-100 =1 41.70 0.58 41.230.66 39.830.96 42.330.64
=0.1 30.890.91 31.320.41 27.75 0.41 33.63 0.36
Table 2: Performance overview given different data settings for ResNet-8 on CIFAR(20 clients with C=0.2, 100 communication rounds, and 20 local epochs per round)

5 Related Work

In this section, we investigate several related works from the perspective of federated learning and knowledge distillation.

Federated Learning

FL, which distributes the overhead of training massive machine learning models to a set of low-computational edge-computational resources has emerged as a promising alternative ML paradigm mcmahan2016communication; mcmahan2016federated. Meanwhile, FL has the ability of jointly training a machine learning model without sharing their private training data mcmahan2016communication; mcmahan2016federated; bonawitz2017practical, therefore the data privacy will be preserved. In FL, one critical problem is on handling the unbalanced, non-independent and identically distributed (non-I.I.D.) data, which are very common in the real world zhao2018federated; sattler2019robust; li2019convergence. In the application level, FL has been widely applied into a wide range of real applications such as next word prediction mcmahan2017communication; mcmahan2018learning, visual object detection for safety FedVision, entity resolution hardy2017private, and medical prediction xu2021fedmood etc.

Knowledge Distillation

Knowledge distillation bucilu2006model; hinton2015distilling

is originally proposed for model compression by mimicking the predictions of the target large deep neural network. The learning process is essential analogous to the learning process of human beings, i.e., a large “teacher” model transfers its knowledge to the small “student” model. Recently, many studies start to work on transferring multiple teachers’ knowledge to a single student model 

hamm2016learning; papernot2017semi. In addition to teacher-student learning, knowledge distillation is also extended to other learning and tasks, such as mutual learning zhang2018deep, assistant teaching mirzadeh2020improved, lifelong learning zhai2019lifelong, and self-learning yuan2019revisit.

6 Conclusion

In this paper, we have proposed FedGKD, a novel global knowledge distillation method to utilize the knowledge learnt from past global models , so as to mitigate the local bias issue. In particular, FedGKD learns a global knowledge model in FL by learning from global knowledge, as well as keeping consistent with current local models. To validate the effectiveness of the proposed method, we have conducted extensive experiments on CIFAR-10/100 datasets under the setting of non-i.i.d. Experimental results show that FedGKD performs better when comparing with several state-of-the-art methods.


Appendix A Appendix

Optionally include extra information (complete proofs, additional experiments and plots) in the appendix. This section will often be part of the supplemental material.