1 Introduction
Federated learning (FL) is a promising solution for largescale collaborative learning, where clients jointly train a machine learning model, while still maintaining local data privacy
[26, 37, 20]. However, in realworld FL implementations over heterogeneous networks, there may be differences in the characteristics of different clients due to diverse annotators’ skill, bias, and hardware reliability [4, 38]. Client data is rarely IID and frequently imbalanced. Also, some clients would have clean data, while other clients may have data with label noise at different noise levels. Hence, the deployment of practical FL systems would face challenges brought by discrepancies in two aspects i): local data statistics [26, 21, 13, 5], and ii): local label quality [4, 38]. Although recent works explored the discrepancy in local data statistics in FL, and learning with label noise in centralized learning (CL), there is at present no unified approach for tackling both challenges simultaneously in FL.The first challenge has been explored in recent FL works, with a focus on performance with convergence guarantees [22, 27]. However, these works have the common implicit assumption that the given labels of local data are completely correct, which is rarely the case in realworld datasets.
The second challenge can be addressed by reweighting [7, 4, 31] or discarding [36] those client updates that are most dissimilar. In these methods, the corresponding clients are primarily treated as malicious agents. However, dissimilar clients are not necessarily malicious and could have label noise in local data that would otherwise still be useful after label correction. ^{†}^{†}This work is supported by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISGRP2019015), and under its NRFF Program (NRFFAI120190005). This work is also supported in part by the SUTD Growth Plan Grant for AI. For FL systems, the requirement of data privacy poses an inherent challenge for any label correction scheme. How can clients identify their noisy labels to be corrected without needing other clients to reveal sensitive information? For example, [38] proposes label correction for identified noisy clients with the guidance of extra data feature information exchanged between clients and server, which may lead to privacy concerns.
Label correction and, more generally, methods to deal with label noise, are wellstudied in CL. Yet, even stateoftheart CL methods for tackling label noise [18, 3, 30, 8, 9, 40, 33, 35], when applied to local clients, are inadequate in mitigating the performance degradation in the FL setting, due to the limited sizes of local datasets. These CL methods cannot be applied on the global sever or across multiple clients due to FL privacy requirements. So, it is necessary and natural to adopt a more general framework that jointly considers the two discrepancies, for a better emulation of realworld data heterogeneity. Most importantly, privacypreserving label correction should be incorporated in training to improve robustness to data heterogeneity in FL.
In this paper, we propose a multistage FL framework to simultaneously deal with both discrepancy challenges; see Fig. 1 for an overview. To ensure privacy, we introduce a dimensionalitybased filter to identify noisy clients, by measuring the local intrinsic dimensionality (LID) [11] of local model prediction subspaces. Extensive experiments have shown that clean datasets can be distinguished from noisy datasets by the behavior of LID scores during training [25, 24]. Hence, in addition to the usual local weight updates, we propose that each client also sends an LID score to the server, which is a single scalar representing the discriminability of the predictions of the local model. We then filter noisy samples based on persample training losses independently for each identified noisy client, and relabel the largeloss samples with the predicted labels of the global model. To improve training stability and alleviate the negative impact caused by noisy clients, we introduce a weighted proximal regularization term, where the weights are based on the estimated local noise levels. Furthermore, we finetune the global model on the identified clean clients and relabel the local data for the remaining noisy clients.
Our main contributions are as follows:

We propose a general multistage FL framework FedCorr to tackle data heterogeneity, with respect to both local label quality and local data statistics.

We propose a general framework for easy generation of federated synthetic label noise and diverse (e.g. nonIID) client data partitions.

We identify noisy clients via LID scores, and identify noisy labels via persample losses. We also propose an adaptive local proximal regularization term based on estimated local noise levels.

We demonstrate that FedCorr outperforms stateoftheart FL methods on multiple datasets with different noise levels, for both IID and nonIID data partitions.
2 Related work
2.1 Federated methods
In this paper, we focus on three closely related aspects of FL: generation of nonIID federated datasets, methods to deal with nonIID local data, and methods for robust FL.
The generation of nonIID local data partitions for FL was first explored in [26], based on dividing a given dataset into shards. More recent nonIID data partitions are generated via Dirichlet distributions [13, 1, 31].
Recent federated optimization work mostly focus on dealing with the discrepancy in data statistics of local clients and related inconsistency issues [21, 32, 1]. For instance, FedProx
deals with nonIID local data, by including a proximal term in the local loss functions
[21], while FedDyn uses a dynamic proximal term based on selected clients [1]. SCAFFOLD [15] is another method suitable for nonIID local data that uses control variates to reduce clientdrift. In [13] and [27], adaptive FL optimization methods for the global server are introduced, which are compatible with nonIID data distributions. Moreover, the PowerofChoice (PoC) strategy [6], a biased client selection scheme that selects clients with higher local losses, can be used to increase the rate of convergence.There are numerous works on improving the robustness of FL; these include robust aggregation methods [31, 19, 7], reputation mechanismbased contribution examining [36], credibilitybased reweighting [4]
, distillationbased semisupervised learning
[14], and personalized multitask learning [19]. However, these methods are not designed for identifying noisy labels. Even when these methods are used to detect noisy clients, either there is no mechanism for further label correction at the noisy clients [31, 19, 36, 7], or the effect of noisy labels is mitigated with the aid of an auxiliary dataset, without any direct label correction [4, 14]. One notable exception is [38], which carries out label correction during training by exchanging feature centroids between clients and server. This exchange of centroids may lead to privacy concerns, since centroids could potentially be used as part of reverse engineering to reveal nontrivial information about raw local data.In contrast to these methods, FedCorr incorporates the generation of diverse local data distributions with synthetic label noise, together with noisy label identification and correction, without privacy leakage.
2.2 Local intrinsic dimension (LID)
Informally, LID [11] is a measure of the intrinsic dimensionality of the data manifold. In comparison to other measures, LID has the potential for wider applications as it makes no further assumptions on the data distribution beyond continuity. The key underlying idea is that at each datapoint, the number of neighboring datapoints would grow with the radius of neighborhood, and the corresponding growth rate would then be a proxy for “local” dimension.
LID builds upon this idea [12] via the geometric intuition that the volume of an dimensional Euclidean ball grows proportionally to when its radius is scaled by a factor of . Specifically, when we have two dimensional Euclidean balls with volumes , , and with radii , , we can compute as follows:
(1) 
We shall now formally define LID. Suppose we have a dataset consisting of vectors in
. We shall treat this dataset as samples drawn from an variate distribution . For any , letbe the random variable representing the (nonnegative) distance from
to a randomly selected point drawn from , and letbe the cumulative distribution function of
. Given and a sample point drawn from , define the LID of at distance to beprovided that it exists, i.e. provided that is positive and continuously differentiable at . The LID at is defined to be the limit . Intuitively, the LID at is an approximation of the dimension of a smooth manifold containing that would “best” fit the distribution in the vincinity of .
Estimation of LID: By treating the smallest neighbor distances as “extreme events” associated to the lower tail of the underlying distance distribution, [2] proposes several estimators of LID based on extreme value theory. In particular, given a set of points , a reference point , and its nearest neighbors in , the maximumlikelihood estimate (MLE) of is:
(2) 
where denotes the distance between and its th nearest neighbor, and is the maximum distance from among the nearest neighbors.
3 Proposed Method
In this section, we introduce FedCorr, our proposed multistage training method to tackle heterogeneous label noise in FL systems (see Algorithm 1). Our method comprises three stages: preprocessing, finetuning and usual training. In the first stage, we sample the clients without replacement using a small fraction to identify noisy clients via LID scores and noisy samples via persample losses, after which we relabel the identified noisy samples with the predicted labels of the global model. The noise level of each client is also estimated in this stage. In the second stage, we finetune the model with a typical fraction on relatively clean clients, and use the finetuned model to further correct the samples for the remaining clients. Finally, in the last stage, we train the model via the usual FL method (FedAvg [26]) using the corrected labels at the end of the second stage.
3.1 Preliminaries
Consider an FL system with clients and an class dataset , where each denotes the local dataset for client . Let denote the set of all clients, and let (resp. ) denote the local model weights of client (resp. global model weights obtained by aggregation) at the end of communication round . At the end of round , the global model would have its weights updated as follows:
(3) 
where is the subset of selected clients in round .
For the rest of this subsection, we shall give details on client data partition, noise model simulation, and LID score computation. These are three major aspects of our proposed approach to emulate data heterogeneity, and to deal with the discrepancies in both local data statistics and label quality.
Data partition. We consider both IID and nonIID heterogeneous data partitions in this work. For IID partitions, the whole dataset
is uniformly distributed at random among
clients. For nonIID partitions, we first generate an indicator matrix , where each entry indicates whether the local dataset of client contains class . Eachshall be sampled from the Bernoulli distribution with a fixed probability
. For each , let be the sum of entries in the th column of ; this equals the number of clients whose local datasets contain class . Let be a vector of length , sampled from the symmetric Dirichlet distribution with the common parameter . Using as a probability vector, we then randomly allocate the samples within class to these clients. Note that our nonIID data partition method provides a general framework to control the variability in both class distribution and the sizes of local datasets (see Fig. 2).Noise model. To emulate label noise in realworld data, we shall introduce a general federated noise model framework. For simplicity, this work only considers instanceindependent label noise. This framework has two parameters and , where denotes the system noise level (ratio of noisy clients) and denotes the lower bound for the noise level of a noisy client. Every client has a probability of being a noisy client, in which case the local noise level for this noisy client is determined randomly, by sampling from the uniform distribution . Succinctly, the noise level of client (for ) is
(4) 
When , the noisy samples are chosen uniformly at random, and are assigned random labels, selected uniformly from the classes.
LID scores for local models. In this paper, we associate LID scores to local models. Consider an arbitrary client with local dataset and current local model . Let be the set of prediction vectors, and for each , compute w.r.t. the nearest neighbors in , as given in (2). We define the LID score of to be the average value of over all . Note that as the local model gets updated with each round, the corresponding LID score will change accordingly.
Experiments have shown that given the same training process, models trained on a dataset with label noise tend to have larger LID scores as compared to models trained on the same dataset with clean labels [25, 24]. Intuitively, the prediction vectors of a welltrained model, trained on a clean dataset, would cluster around possible onehot vectors, corresponding to the classes. However, as more label noise is added to the clean dataset, the prediction vector of a noisy sample would tend to be shifted towards the other clusters, with different noisy samples shifted in different directions. Hence, the prediction vectors near each onehot vector would become “more diffuse” and would on average span a higher dimensional space.
3.2 Federated preprocessing stage
FedCorr begins with the preprocessing stage, which iteratively evaluates the quality of the dataset of each client, and relabels identified noisy samples. This preprocessing stage differs from traditional FL in the following aspects:

All clients will participate in each iteration. Clients are selected without replacement, using a small fraction.

An adaptive local proximal term is added to the loss function, and mixup data augmentation is used.

Each client computes its LID score and persample crossentropy loss after local training and sends its LID score together with local model updates to the server.
Client iteration and fraction scheduling.
The preprocessing stage is divided into iterations. In each iteration, every client participates exactly once. Every iteration is organized by communication rounds, similar to the usual FL, but with two key differences: a small fraction is used, and clients are selected without replacement. Each iteration ends when all clients have participated.
It is known that large fractions could help improve the convergence rate[26], and a linear speedup could even be achieved in the case of convex loss functions[29]. However, large fractions have a weak effect in nonIID settings, while intuitively, small fractions would yield aggregated models that deviate less from local models; cf. [23]. These observations inspire us to propose a fraction scheduling scheme that combines the advantages of both small and large fractions. Specifically, we sample clients using a small fraction without replacement in the preprocessing stage, and use a typical larger fraction with replacement in the latter two stages. By sampling without replacement during preprocessing, we ensure all clients participate equally for the evaluation of the overall quality of labels in local datasets.
Mixup and local proximal regularization.
Throughout the preprocessing stage, for client with batch (where denotes batch size), we use the following loss function:
(5) 
Here, denotes the local model of client in round , and denotes the weights of the global model obtained in the previous round . The first term in (5) represents the crossentropy loss on the mixup augmentation of , while the second term in (5) is an adaptive local proximal regularization term, where is the estimated noise level of client to be defined later. It should be noted that our local proximal regularization term is only applied in the preprocessing stage.
Recall that mixup [41] is a data augmentation technique that favors linear relations between samples, and that has been shown to exhibit strong robustness to label noise [3, 18]. Mixup generates new samples as convex combinations of randomly selected pairs of samples and , given by , , where , and . (We use
in our experiments.) Intuitively, mixup achieves robustness to label noise due to random interpolation. For example, if
is a noisy sample and if is the true label, then the negative impact caused by an incorrect label is alleviated when paired with a sample whose label is .Our adaptive local proximal regularization term is scaled by , which is the estimated noise level of client computed at the end of round
. (In particular, this term would vanish for clean clients.) The hyperparameter
is also incorporated to control the overall effect of this term. Intuitively, if a client’s dataset has a larger discrepancy from other local datasets, then the corresponding local model would deviate more from the global model, thereby contributing a larger loss value for the local proximal term.Identification of noisy clients and noisy samples.
To address the challenge of heterogeneous label noise, we shall iteratively identify and relabel the noisy samples. In each iteration of this preprocessing stage, where all clients will participate, every client will compute the LID score and persample loss for its current local model (see Algorithm 1, lines 39). Specifically, when client is selected in round , we train the model on the local dataset and then compute the LID score of via (2). Note that our proposed framework preserves the privacy of client data, since in comparison to the usual FL, there is only an additional LID score sent to the server, which is a single scalar that reflects only the predictive discriminability of the local model. Since the LID score is computed from the predictions of the output layer (of the local model), knowing this LID score does not reveal information about the raw input data. This additional LID score is a single scalar, hence it has a negligible effect on communication cost.
At the end of iteration , we shall perform the following three steps:

The server first computes a Gaussian Mixture Model (GMM) on the cumulative LID scores of all
clients. Using this GMM, the set of clients is partitioned into two subsets: (noisy clients) and (clean clients). 
Each noisy client locally computes a new GMM on the persample loss values for all samples in the local dataset . Using this GMM, is partitioned into two subsets: a clean subset , and a noisy subset . We observe that the largeloss samples are more likely to have noisy labels. The local noise level of client can then be estimated by if , and otherwise.

Each noisy client performs relabeling of the noisy samples by using the predicted labels of the global model as the new labels. In order to avoid overcorrection, we only relabel those samples that are identified to be noisy with high confidence. This partial relabeling is controlled by a relabel ratio and a confidence threshold . Take noisy client for example: We first choose samples from that corresponds to the top largest persample crossentropy losses. Next, we obtain the prediction vectors of the global model, and relabel a sample only when the maximum entry of its prediction vector exceeds . Thus, the subset of samples to be relabeled is given by
(6) (7) where is the global model at the end of iteration .
Why do we use cumulative LID scores in step 1?
In deep learning, it has been empirically shown that when training on a dataset with label noise, the evolution of the representation space of the model exhibits two distinct phases: (1) an early phase of dimensionality compression, where the model tends to learn the underlying true data distribution, and (2) a later phase of dimensionality expansion, where the model overfits to noisy labels
[25].We observed that clients with larger noise levels tend to have larger LID scores. Also, the overlap of LID scores between clean and noisy clients would increase during training. This increase could be due to two reasons: (1) the model may gradually overfit to noisy labels, and (2) we correct the identified noisy samples after each iteration, thereby making the clients with low noise levels less distinguishable from clean clients. Hence, the cumulative LID score (i.e., the sum of LID scores in all past iterations) is a better metric for distinguishing noisy clients from clean clients; see the top two plots in Fig. 3 for a comparison of using LID score versus cumulative LID score. Furthermore, the bottom two plots in Fig. 3 show that cumulative LID score has a stronger linear relation with local noise level.
3.3 Federated finetuning stage
We aim to finetune the global model on relatively clean clients over rounds and further relabel the remaining noisy clients. The aggregation at the end of round is given by the same equation (3), with one key difference: is now a subset of , where is the threshold used to select relatively clean clients based on the estimated local noise levels .
At the end of the finetuning stage, we relabel the remaining noisy clients with the predicted labels of . Similar to the correction process in the preprocessing stage, we use the same confidence threshold to control the subset of samples to be relabeled; see (7).
3.4 Federated usual training stage
In this final stage, we train the global model over rounds via the usual FL (FedAvg) on all the clients, using the labels corrected in the previous two training stages. We also incorporate this usual training stage with three FL methods to show that methods based on different techniques can be wellincorporated with FedCorr, even if they are not designed specifically for robust FL; see Sec. 4.2.
4 Experiments
In this section, we conduct experiments in both IID (CIFAR10/100 [16]) and nonIID (CIFAR10, Clothing1M[34]) data settings, at multiple noise levels, to show that FedCorr is simultaneously robust to both local label quality discrepancy and data statistics discrepancy. To demonstrate the versatility of FedCorr, we also show that various FL methods can have their performances further improved by incorporating the first two stages of FedCorr. We also conduct an ablation study to show the effects of different components of FedCorr. Details on data partition and the noise model used have already been given in Sec. 3.1.
4.1 Experimental Setup
Baselines. There are two groups of experiments.
In the first group, we demonstrate that FedCorr is robust to discrepancies in both data statistics and label quality. We compare FedCorr with the following stateoftheart methods from three categories: (1) methods to tackle label noise in CL (JointOpt [30] and DivideMix [18]) applied to local clients; (2) classic FL methods (FedAvg[26] and FedProx[21]); and (3) FL methods designed to be robust to label noise (RoFL[38] and ARFL[7]). For reference, we also report experimental results on JointOpt and DivideMix in CL, so as to show the performance reduction of these two methods when used in FL.
In the second group, we demonstrate the versatility of FedCorr. We examine the performance improvements of three stateoftheart methods when the first two stages of FedCorr are incorporated. These methods are chosen from three different aspects to improve FL: local optimization (FedDyn [1]), aggregation (Median [39]) and client selection (PoC [6]).
Implementation details. We choose different models and number of clients for each dataset; see Tab. 1
. For data preprocessing, we perform normalization and image augmentation using random horizontal flipping and random cropping with padding=4. We use an SGD local optimizer with a momentum of 0.5, with a batch size of 10 for CIFAR10/100 and 16 for Clothing1M. With the exception of
JointOpt and DivideMixused in FL settings, we shall always use 5 local epochs across all experiments. For
FedCorr, we always use the same hyperparameters on the same dataset. In particular, we use for CIFAR10, CIFAR100, Clothing1M, respectively. For fraction scheduling, we use the fraction in the preprocessing stage, and we use the fractions specified in Tab. 1 for the latter two stages. Further implementation details can be found in the supplementary material; see Appendix B.4.2 Comparison with stateoftheart methods
IID settings. We compare FedCorr with multiple baselines at different noise levels, using the same configuration. Tab. 2 and Tab. 3 show the results on CIFAR10 and CIFAR100, respectively. In summary, FedCorr achieves best test accuracies across all noise settings tested on both datasets, with particularly significant outperformance in the case of high noise levels. Note that we have implemented JointOpt and DivideMix in both centralized and federated settings to show the performance reduction ( lower for best accuracy) when these CL methods are applied to local clients in FL. Furthermore, the accuracies in CL can also be regarded as upper bounds for the accuracies in FL. Remarkably, the accuracy gap between DivideMix in CL and FedCorr in FL is even in the extreme noise setting . In the centralized setting, we use the dataset corrupted with exactly the same scheme as in the federated setting. For the federated setting, we warm up the global model for 20 rounds with FedAvg to avoid introducing additional label noise during the correction process in the early training stage, and we then apply JointOpt or DivideMix locally on each selected client, using 20 local training epochs.
NonIID settings. To evaluate FedCorr in more realistic heterogeneous data settings, we conduct experiments using the nonIID settings as described in Sec. 3.1, over different values for . Tab. 4 and Tab. 5 show the results on CIFAR10 and Clothing1M, respectively. Note that we do not add synthetic label noise to Clothing1M, since it already contains realworld label noise. For CIFAR10, FedCorr consistently outperforms all baselines by at least . For Clothing1M, FedCorr also achieves the highest accuracy in FL, and this accuracy is even higher than the reported accuracy of JointOpt in CL.
Combination with other FL methods. We also investigate the performance of three stateoftheart methods, when the first two stages of FedCorr are incorporated. As shown in Fig. 4, we consistently obtain significant accuracy improvements on CIFAR10/100 for various ratios of noisy clients.
4.3 Ablation study
Tab. 6 gives an overview of the effects of the components in FedCorr. Below, we consolidate some insights into what makes FedCorr successful:

All components help to improve accuracy.

Fraction scheduling has the largest effect. The small fraction used in the preprocessing stage helps to capture local data characteristics, as it avoids information loss brought by aggregation over multiple models.

The highest accuracy among different noise levels is primarily achieved at a low noise level (e.g. ) and not at the zero noise level, since additional label noise could be introduced during label correction.
5 Conclusion
We present FedCorr, a general FL framework that jointly tackles the discrepancies in both local label quality and data statistics, and that performs privacypreserving label correction for identified noisy clients. Our experiments demonstrate the robustness and outperformance of FedCorr at multiple noise levels and diverse data settings.
In its current formulation, FedCorr does not consider dynamic participation in FL, whereby clients can join or leave training at any time. New clients joining much later would always have relatively lower cumulative LID scores, which means new noisy clients could be categorized incorrectly as clean clients. Thus, further work is required to handle dynamic participation.
References
 [1] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas, Matthew Mattina, Paul Whatmough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations, 2020.
 [2] Laurent Amsaleg, Oussama Chelly, Teddy Furon, Stéphane Girard, Michael E Houle, Kenichi Kawarabayashi, and Michael Nett. Estimating local intrinsic dimensionality. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 29–38, 2015.
 [3] Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning, pages 312–321. PMLR, 2019.
 [4] Yiqiang Chen, Xiaodong Yang, Xin Qin, Han Yu, Biao Chen, and Zhiqi Shen. Focus: Dealing with label quality disparity in federated learning. In International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with IJCAI(FLIJCAI’20), 2020.

[5]
Zihan Chen, Kai Fong Ernest Chong, and Tony QS Quek.
Dynamic attentionbased communicationefficient federated learning.
In
International Workshop on Federated and Transfer Learning for Data Sparsity and Confidentiality in Conjunction with IJCAI (FTLIJCAI’2021)
, 2021.  [6] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Client selection in federated learning: Convergence analysis and powerofchoice selection strategies. arXiv preprint arXiv:2010.01243, 2020.
 [7] Shuhao Fu, Chulin Xie, Bo Li, and Qifeng Chen. Attackresistant federated learning with residualbased reweighting. In AAAI Workshop Towards Robust, Secure and Efficient Machine Learning, 2021.
 [8] Bo Han, Quanming Yao, Tongliang Liu, Gang Niu, Ivor W Tsang, James T Kwok, and Masashi Sugiyama. A survey of labelnoise representation learning: Past, present and future. arXiv preprint arXiv:2011.04406, 2020.

[9]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W Tsang,
and Masashi Sugiyama.
Coteaching: Robust training of deep neural networks with extremely noisy labels.
In NeurIPS, 2018. 
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [11] Michael E Houle. Dimensionality, discriminability, density and distance distributions. In 2013 IEEE 13th International Conference on Data Mining Workshops, pages 468–473. IEEE, 2013.
 [12] Michael E Houle. Local intrinsic dimensionality i: an extremevaluetheoretic foundation for similarity applications. In International Conference on Similarity Search and Applications, pages 64–79. Springer, 2017.
 [13] TzuMing Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of nonidentical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
 [14] Sohei Itahara, Takayuki Nishio, Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. Distillationbased semisupervised federated learning for communicationefficient collaborative training with noniid private data. IEEE Transactions on Mobile Computing, 2021.
 [15] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
 [16] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 [17] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.

[18]
Junnan Li, Richard Socher, and Steven C.H. Hoi.
Dividemix: Learning with noisy labels as semisupervised learning.
In International Conference on Learning Representations, 2020.  [19] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning, pages 6357–6368. PMLR, 2021.
 [20] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
 [21] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, volume 2, pages 429–450, 2020.
 [22] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on noniid data. In International Conference on Learning Representations, 2019.
 [23] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on noniid data. arXiv preprint arXiv:1907.02189, 2019.
 [24] Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018.
 [25] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionalitydriven learning with noisy labels. In International Conference on Machine Learning, pages 3355–3364. PMLR, 2018.
 [26] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
 [27] Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations, 2021.
 [28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 [29] Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
 [30] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5552–5560, 2018.
 [31] Ching Pui Wan and Qifeng Chen. Robust federated learning with attackadaptive aggregation. In International Workshop on Federated and Transfer Learning for Data Sparsity and Confidentiality in Conjunction with IJCAI (FTLIJCAI’2021), 2021.
 [32] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 2020.
 [33] Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong, Haifeng Liu, Gang Niu, Dacheng Tao, and Masashi Sugiyama. Partdependent label noise: Towards instancedependent label noise. NeurIPS, 33, 2020.
 [34] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2691–2699, 2015.

[35]
Jingyi Xu, Tony QS Quek, and Kai Fong Ernest Chong.
Training classifiers that are universally robust to all label noise levels.
In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.  [36] Xinyi Xu and Lingjuan Lyu. A reputation mechanism is all you need: Collaborative fairness and adversarial robustness in federated learning. In International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML(FLICML’21), 2021.
 [37] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
 [38] Seunghan Yang, Hyoungseob Park, Junyoung Byun, and Changick Kim. Robust federated learning with noisy labels. IEEE Intelligent Systems, 2022.
 [39] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5650–5659. PMLR, 10–15 Jul 2018.
 [40] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In ICML, pages 7164–7173. PMLR, 2019.
 [41] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
Appendix A Outline
As part of supplementary material for our paper titled “FedCorr: MultiStage Federated Learning for Label Noise Correction”, we provide further details, organized into the following sections:

Appendix B introduces the implementation details for our method and baselines.

Appendix C provides further details on our experiments.

Sec. C.1 gives additional experiment results on CIFAR100 with a nonIID data partition.

Sec. C.2 shows that FedCorr is modelagnostic, via a comparison of the test accuracies and the distributions of cumulative LID scores, using different model architectures.

Sec. C.3 gives a comparison of the communication efficiency of different methods.

Sec. C.4 explains why cumulative LID scores are preferred over LID scores for identifying noisy clients.

Sec. C.5 demonstrates the effectiveness of both the label noise identification and the label correction process in FedCorr.

Sec. C.6 gives further details on the ablation study results for FedCorr.

Sec. C.7 provides further intuition on the nonIID data settings used in our experiments, via explicit illustrations of the corresponding nonIID data partitions on CIFAR10, over 100 clients.


Appendix D discusses the potential negative societal impact of FedCorr.
Appendix B Implementation details
All experiments were implemented using Pytorch. Among the baselines, we reimplemented
FedAvg [26], FedProx [21], JointOpt [30], DivideMix [18] and PoC [6], and we used the official implementations of FedDyn [1] and ARFL [7]. For RoFL^{1}^{1}1https://github.com/jangsoohyuk/RobustFederatedLearningwithNoisyLabels and Median^{2}^{2}2https://github.com/fushuhao6/AttackResistantFederatedLearning, we used their unofficial implementations. For all methods, we use an SGD local optimizer with a momentum of 0.5 and no weight decay, with a batch size of 10 for CIFAR10/100 and 16 for Clothing1M. Note that at each noise level, we used the same training hyperparameters for both IID and nonIID data partitions.For the implementation of each federated learning (FL) method, we define its total communication cost to be the cumulative number of clients that participate in training. For example, if a client participates in 10 communication rounds, then that client would contribute 10 to the total communication cost. For every method except JointOpt and DivideMix, we always reimplement the method using 5 local epochs per communication round, and the same total communication cost for each dataset, which corresponds to 1000 rounds of FedAvg for CIFAR10/100 with fraction 0.1, and corresponds to 200 rounds of FedAvg for Clothing1M with fraction 0.02. Settings for JointOpt and DivideMix are discussed below.
In the rest of this section, we give full details on all remaining hyperparameters used for each method. For baseline methods, we also provide brief descriptions of their main underlying ideas.

FedCorr. We fixed for LID estimation, for mixup, and for the proximal regularization term in all reported experiments. All remaining hyperparameters can be found in Tab. 7. Note that the total communication cost for FedCorr is the same as for other baselines. Take CIFAR10 for example: In each iteration of the preprocessing stage of FedCorr, every client participates exactly once. In contrast, in each communication round of our other baselines, only a fraction of 0.1 of the clients would participate. Hence, one iteration in the preprocessing stage of FedCorr has 10 times the total communication cost of one communication round of the other baselines.
For the latter two stages of FedCorr, we used the usual 0.1 as our fraction. Hence the total communication cost of the entire implementation of FedCorr equals ; this is the same total communication cost for implementing FedAvg over communication rounds with fraction 0.1.

JointOpt [30] is one of the stateoftheart centralized methods for tackling label noise, which alternately updates network parameters and corrects labels using the model prediction vectors. It introduced and as two additional hyperparameters. In the centralized setting, we used the hyperparameters given in Tab. 8. In particular, we considered a total of seven noise settings, which we have divided into two groups: low noise levels (first four settings) and high noise levels (last three settings). Within each group, we used the same hyperparameters. Note that the hyperparameters are not exactly the same as those given in [30], as we used different architectures and different frameworks to generate synthetic label noise. In the federated setting, we used , and a learning rate of for CIFAR10/100. To boost performance, we used a warmup process for CIFAR10/100: We first trained using FedAvg over 20 communication rounds with 5 local epochs per communication round, after which we started using JointOpt for local training over 80 communication rounds with 20 local epochs per communication round. For Clothing1M, we used , , and a learning rate of
. As we used a ResNet50 that is already pretrained on ImageNet, no warmup process was used for our Clothing1M experiments. We trained using
JointOpt over 40 communication rounds with 10 local epochs per round. 
DivideMix [18] is another stateoftheart centralized method, which dynamically divides the training data into labeled (clean) and unlabeled (noisy) data, and trains the model in a semisupervised manner. For CIFAR10/100, we used the same two groups of noise settings, as described in the above configuration for JointOpt. The only hyperparameter we tuned is , which is a hyperparameter specific to DivideMix. For low noise levels, we used (resp. ) for CIFAR10 (resp. CIFAR100). For high noise levels, we used (resp. ) for CIFAR10 (resp. CIFAR100). For all other hyperparameters for CIFAR10/100, we used the values given in [18]. For Clothing1M, we use and a learning rate of ; for all other hyperparameters, we used the values given in [18]. We used the same warmup process for CIFAR10/100, and we used the same number of communication rounds and number of local epochs for all datasets, as described above in our configuration for JointOpt.

FedAvg [26] is the first algorithm that introduced the idea of federated learning. We used a learning rate of 0.03, 0.01 and 0.003 for CIFAR10, CIFAR100 and Clothing1M, respectively.

FedProx [21] was proposed to tackle data heterogeneity among clients by adding a fixed proximal term with coefficient to every local loss function. We used for all experiments, and a learning rate of 0.01 and 0.003 for CIFAR10/100 and Clothing1M, respectively.

RoFL [38] is, to the best of our knowledge, the only method that is designed for label correction in FL. It is based on the idea of exchanging feature centroids between the server and clients, and it introduced as an additional hyperparameter to control label correction. We set to 100, 400 and 10 for CIFAR10, CIFAR100 and Clothing1M, respectively. All other hyperparameters are set to the same values as given in [38].

ARFL [7] is a robust aggregation algorithm that resists abnormal attacks via residualbased reweighting, using two hyperparameters and threshold . We used and for all experiments. We used a learning rate of and for CIFAR10/100 and Clothing1M, respectively.

FedDyn [1] proposed a dynamic regularizer, with coefficient , for local optimization in each communication round, so as to tackle the inconsistency between the local and global empirical loss. We used , a learning rate of 0.1 with a decay of 0.998 for all the experiments.

Median [39] is an aggregation method for robust distributed learning, whereby the notion of “average” in FedAvg is changed from “mean” to “median”. For all experiments, we used a learning rate of 0.01; all other hyperparameters are the same as given in FedAvg.

Poc [6] is a client selection algorithm that is biased towards clients with higher local losses within a given client pool. We used a learning rate of 0.01 and a client pool size of for all experiments.
Appendix C Details on experiment results
Method()  
FedAvg  64.751.75 
FedProx  65.721.30 
RoFL  59.314.14 
ARFL  48.034.39 
JointOpt  59.841.99 
DivideMix  39.761.18 
Ours  72.731.02 
Method  Best Test Accuracy (%) Standard Deviation (%)  
ResNet18  93.820.41  94.010.22  94.150.18  92.930.25  92.500.28  91.520.50  90.590.70 
VGG11  88.960.84  87.930.41  87.530.40  84.781.68  84.820.79  83.340.42  80.822.62 
LeNet5  72.030.35  70.470.86  70.021.39  69.090.16  67.480.54  67.490.74  65.160.53 
c.1 CIFAR100 with nonIID data partition
Method  
Ours  150  210  230  230  330  360  510 
FedAvg  370(2.6)  450(2.1)  470(2.0)  550(2.4)  930(2.8)  810(2.3)   
FedProx  690(4.9)  1050(5.0)  1190(5.2)  1230(5.3)  1600(4.8)  1730(4.8)  4640(9.1) 
RoFL  990(7.1)  1390(6.6)  1580(6.9)  1900(8.3)  4200(12.7)  2080(5.8)   
ARFL  290(2.1)  740(3.5)  1180(5.1)         
JointOpt  330(2.4)  420(2.0)  760(3.3)  550(2.4)       
DivideMix               
In terms of robustness to the discrepancies in both local label quality and local data statistics, FedCorr significantly outperforms the baselines. In the main paper, we have reported the outperformance of FedCorr on CIFAR100 with IID data partition. To further show the outperformance on nonIID data partitions, we also conducted experiments on CIFAR100 with noise model and nonIID hyperparameter ; here, we report our results in Tab. 9. We observe that FedCorr achieves an improvement in best test accuracy of at least over our baselines.
Method  
Ours  50  60  90  70  110  90  190 
FedAvg  160(3.2)  200(3.3)  210(2.3)  230(3.3)  300(2.7)  270(3.0)  470(2.5) 
FedProx  300(6.0)  430(7.2)  500(5.6)  480(6.9)  690(6.3)  670(7.4)  1840(9.7) 
RoFL  350(7.0)  420(7.0)  470(5.2)  440(6.3)  670(6.1)  490(5.4)  1710(9.0) 
ARFL  120(2.4)  230(3.8)  170(1.9)  240(3.4)  390(3.5)  270(3.0)   
JointOpt  160(3.2)  200(3.3)  220(2.4)  220(3.1)  250(2.3)  250(2.8)  860(4.5) 
DivideMix  480(9.6)  560(9.3)  580(6.4)  590(8.4)  690(6.3)  930(10.3)  970(5.1) 
c.2 Comparison of different architectures
To demonstrate that our proposed FedCorr is modelagnostic, especially with respect to the noisy client identification scheme via cumulative LID scores, we conducted experiments on CIFAR10 with IID data partition using different architectures: ResNet18 [10], VGG11 [28] and LeNet5 [17]. Tab. 10 shows the best test accuracies of each model trained on CIFAR10 with various levels of synthetic noise. For experiments on VGG11, we used hyperparameters with the same values as used in the experiments on ResNet18. For LeNet5, we only tuned the learning rate and fixed it at 0.003 in all experiments. Fig. 5 shows a further comparison between different architectures in terms of the distribution of the cumulative LID scores and the corresponding separations of the clients via Gaussian Mixture Models.
c.3 Comparison of communication efficiency
In this subsection, we discuss the communication efficiency of different methods. Here, given any implementation of an FL method, and any desired target accuracy , we define its targeted communication cost for test accuracy to be the lowest total communication cost required (in the experiments) to reach the target test accuracy. Informally, the lower the targeted communication cost, the higher the communication efficiency.
Tab. 11 and Tab. 12 show the comparison of the communication efficiency on CIFAR10, in terms of the targeted communication cost at test accuracies and , respectively. Tab. 13 shows the comparison on CIFAR100, in terms of the targeted communication cost at test accuracy . As our results show, FedCorr achieves improvements in communication efficiency, by a factor of at least 1.9 on CIFAR10, and at least 1.3 on CIFAR100.
c.4 Distribution of cumulative LID scores
Fig. 6 shows the comparison between the distribution of the LID scores and the distribution of the cumulative LID scores, after each iteration in the preprocessing stage. The LID scores of clean clients and noisy clients can be wellseparated after the second iteration and the third iteration. This is also true for the cumulative LID scores. However, after the fourth iteration, the LID scores of noisy clients and clean clients start overlapping, while in contrast, the cumulative LID scores of noisy clients and clean clients remain wellseparated. As already discussed in the main paper, cumulative LID scores have a stronger linear relation with local noise levels, as compared to LID scores. Hence, the cumulative LID score is a more robust metric for identifying noisy clients.
c.5 Evaluation of label noise identification and label correction
Fig. 7 demonstrates the effectiveness of the label noise identification and correction process in the preprocessing stage on CIFAR10. Note that in Fig. 7, we used the noise setting , which means on average of the clients are randomly selected for the addition of synthetic noise to their local datasets before training, whereby the local noise level for each selected client is at least . The top plot in Fig. 7 shows the estimated noise levels, in comparison with the groundtruth noise levels (before training and after stage 1), across all 100 clients. In particular, the huge gap between the groundtruth noise levels before training (blue dotted line) and after stage 1 (orange line) represents the effectiveness of our label correction process, while the small gap between the estimated noise levels (green line) and the groundtruth noise levels after stage 1 (orange line) reflects the effectiveness of our local noise level estimation. Note that for clean clients (with zero groundtruth noise levels before training), FedCorr is able to estimate their noise levels to be exactly zero in most cases. Consequently, no additional label noise is introduced to these identified clean clients in our label correction process.
The bottom plot in Fig. 7 shows the separation results between noisy and clean samples (via a Gaussian Mixture Model) for each identified noisy client, in terms of true/false positives/negatives. In particular, the small numbers of false positives across all identified noisy clients imply the effectiveness of FedCorr in identifying noisy samples.
To further illustrate the effectiveness of the label correction process, we compared the confusion matrices of the given labels before training, the corrected labels after the preprocessing stage, and the corrected labels after the finetuning stage. Fig. 8 depicts the confusion matrices for the first 5 clients, in the experiments conducted on CIFAR10 with IID data partition and noise setting . For all five selected clients, the groundtruth noise levels after label correction are close to . Notice also that for client 2, whose dataset initially has no noisy labels, only a minimal amount of label noise is introduced during the label correction process.
c.6 Additional ablation study results
In Fig. 9, we show the effects of the components of FedCorr on test accuracies during training. In particular, note that without the finetuning stage, the total communication cost would be 5000. Hence in Fig. 9, the curve plotted for FedCorr without finetuning ends at the 5000 communication cost mark, which is to the left of the second red dotted line (5500 communication cost). As we mentioned in the main paper, fraction scheduling plays the most significant role in FedCorr. In addition, the label correction process would significantly improve training stability, especially in the usual training stage.
c.7 Illustration of nonIID data partitions on CIFAR10
As reported in the main paper, we used 3 different nonIID local data settings () for our experiment involving nonIID data partitions. In Fig. 10, we illustrate the detailed local class distributions and local dataset sizes for these three nonIID data settings on CIFAR10, over clients.
Appendix D Potential negative impact: the issue of freeloaders
In realworld FL implementations, there is the implicit assumption that clients are collaborative and jointly collaborate to train a global model. Although FedCorr allows for a robust training of a global model even when some clients have label noise, this also includes the case when a client is a “freeloader”, where the client’s local dataset has completely random label noise (e.g. randomly assigning labels to an unlabeled dataset, without any actual nontrivial annotation effort). By participating in the FedCorr FL framework, such a “freeloader” would effectively use FedCorr as the actual annotation process, whereby identified noisy labels are corrected. Hence, this would be unfair to clients that have performed annotation on their local datasets prior to participating in FedCorr.