With the rapid development of edge devices, e.g., mobile phones and wearable devices, unprecedented amounts of data are generated. With these data, machine learning has made breakthrough progresses and provided many successful applications, e.g., next word prediction, cardiac health monitoring , load forecasting [3, 4]
, pose estimation[5, 6] and autonomous driving [7, 8, 9], etc. In these applications, traditional methods for machine learning are centralized where corresponding data is aggregated and calculated on a server . However, due to privacy concerns and data protection regulations , it becomes difficult to implement these centralized methods in practice.
In order to solve the above problems, federated learning (FL)  emerges as a new paradigm, which constructs a shared model on a server by aggregating local models learned from training data of each client. Since model training is performed locally on clients and no data is transmitted, FL can comprehensively protect privacy of clients and fully utilize computational resources of those edge devices. However, compared to distributed learning in data center settings and traditional private data analysis, FL is challenged by statistical heterogeneity, where the data on clients is not identically or independently distributed (Non-IID) . Ref.  points out that facing Non-IID data, the accuracy and the convergence rate of Federated Averaging (FedAvg) 
, the most basic method in federated learning, degrades significantly. In FedAvg with Non-IID data, although the number of communication rounds is reduced by increasing the number of iterations for the training on clients, a large number of epochs may cause each client to achieve its local optimum rather than a global optimum such that the algorithm diverges. Therefore, many studies devote to enhancing the performance of FedAvg on Non-IID data by improving local training.
For improving local trainings, FedProx  introduces a proximal term to penalize local objective functions when a distance between local and global models becomes large. However, in FedProx, penalty coefficients are the same for different parameters in the proximal term, and the importance of the parameters often varies in different tasks. Inspired by lifelong learning , FedCurv  employs a penalty to the parameters that deviate from other clients with a large amount of information. However, it is necessary to transmit Fisher information of clients to server, which increases communication costs. To tackle this problem, FedCL  introduces a small auxiliary dataset on the server with which the Fisher information of the global model is calculated to obtain the penalty coefficient matrix. However, the dataset on the server side leaks privacy. Instead of adding constraints to a local objective, Ref.  introduces global and local control variables to revise local updates, which doubles communication costs. However, as shown in Ref. , these approaches fail to achieve a good performance on image datasets, which can be as bad as FedAvg.
In this paper, instead of limiting the distance of updates for local models in most methods, we propose global-update-guided federated learning (FedGG), which guides local updates by maximizing a cosine similarity between global and local updates. In this way, we can reduce model divergence between clients and improve the generalization ability of federated models. Unlike FedCurv, FedCl and SCAFFOLD methods, our method does not introduce additional communication costs. Furthermore, considering that the update direction of the global model contains different amounts of information in different periods of training, we propose adaptive weights based on the distances of updates for local models, which makes our algorithm more flexible and stable. Our numerical experiments show that, compared with other advanced algorithms, FedGG has an accuracy improvement of , andtimes at most.
2.1 Problem Statement of Federated Learning
Federated learning aims to learn a consensus model in a decentralized manner. Concretely speaking, considering clients where each client possesses a local dataset , federated learning is to learn a global model over the dataset with the coordination of a central server, while raw data remains in clients. An optimization description for federated learning is
where denotes the cardinality of sets and
is the expectation for the cross-entropy loss functionof the client in classification problems. Here, denotes the input data and its corresponding label .
2.2 Federated Averaging Algorithm
As the first algorithm in federated learning, federated Averaging (FedAvg) algorithm requires all clients to share the same training configurations from optimizers to learning rates. In each communication round , a subset of clients with is selected and the central server sends a global model
to them. Each selected client locally executes the stochastic gradient descent (SGD) algorithm fortimes, where is taken as an initial model. This local model is updated as
by optimizing a local objective , where , , is the learning rate, and is the gradient at the client . Since the batch stochastic gradient descent algorithm is usually used in practice, the number of updates can be computed as , where is the mini-batch size and is the number of epochs. In the above FedAvg algorithm, we require . Note that when , this algorithm reduces to FedSGD .
After times local updates, each client sents to the server which updates the global model as
3 Global-Update-Guided Federated Learning Algorithm
Unlike previous methods for improving FedAvg by limiting the local distance for updates, we address data heterogeneity by correcting the direction of updates for local models. Furthermore, given that the update information of a global model is of importance in different stages of training, we introduce an adaptive loss weight mechanism based on the distance of updates for the local model to improve the adjustable range of algorithm parameters and the stability of our algorithm. With the above considerations, we propose global-update-guided federated learning (FedGG).
3.1 Measurement of the Similarity of Updates
The main idea of our method is to guide the update of local models along the direction of updates for a global model. That is, we want the local update direction to be highly similar to the global one. In this paper, we introduce a cosine function to evaluate the similarity between variations of two models, which is often used in text classifications .
First we need to determine the representation of the direction of updates for the global model and local models, respectively. To represent it of the global model, we use the differences between the last two global models received by clients, i.e.,
at communication round instead of an intuitive idea of using global gradients. This is because the global gradients is difficult to obtain and we are also required to protect data privacy. To show this, we can rewrite Eq. (3) as
Only when , the update process of the server can be considered as a stochastic gradient descent with a learning rate 1 and represents the global gradient. This means that the client needs to upload the model after each iteration, which defeats the original purpose of FedAvg for reducing communication costs through multiple local iterations. Nevertheless, when the number of local iterations , we can use to approximate the global gradient. However, in the communication round , the participating clients cannot get , because can only be obtained after performing local updates. Hence, given the above issues, can be a good choice to represent the global model update direction, since it’s similar to especially in the early stage of communications, and can be easily obtained from for clients participating in federated learning.
For local models, we use the differences between the current local models and the global model to represent its direction for updates, i.e.,
for the client at the local iteration, but not the gradient of the local model. This is because if the gradient of the local model is directly introduced into the objective function, the second derivative for the update in the stochastic gradient descent algorithm would greatly increase computational costs.
Hence, with the expressions of the global and local model updates, for the client at the local iteration, we can evaluate the similarity of the changed direction between the global model and the local model by the cosine similarity
denotes the inner product between vectors,denotes the distance of a vector, denotes the angle between the update vectors of the global and local models at the iteration. The value of the cosine similarity of the directions of updates for the global and local models lies in between 0 and 1. The greater the value is, the more similar the update directions of the global and local models are.
3.2 Local Training Objective
The basic idea of our method is to guide the update of the local model along the update direction of the global model. Thus, a local objective function can be divided into two parts. The first part is the original local loss function, which can be a classic supervised learning loss (e.g., cross-entropy loss) denoted as. The second part is used to guide the update of the local model, which is a model-cosine loss . Since the more similar the updates of the models, the smaller this loss value should be, in the communication round , we define the model-cosine loss for the client as
Thus, the total loss for an input is computed as
where is a hyper-parameter to adjust the weights of the model-cosine loss. Hence, the local optimization problem is mathematically described as
With the local objective (10), we force the local model to fit the local data distribution with the help of the direction of the global model update, which alleviates the discrepancy between the local and global models under the Non-IID condition, and improves the generalization ability of the model.
3.3 Adaptive Loss Weights
To further explore the guiding role of model-cosine loss on local model update, we calculate the gradient of the local objective function with respect to the local model as
where , and for brevity, we denote and as and , respectively.
Note that compared with FedAvg, FedGG introduces two vectors in the same direction as the normalized vectors and to correct the local update of the client, and the lengths are and respectively. The former is the update direction for the global model, which is used to guide the update of the local model. The latter is the direction in which the current local model is updated from the global model. Also, its length has an extra coefficient compared to the former, which is to evaluate the update direction. This is because local training also needs to fit the local data distribution such that the local update direction would not be the same as the global model update direction. If the direction is similar to the direction of the global update, a larger weight is applied, and vice versa.
Furthermore, we note that if the weight-tuning parameter is set to be a constant, the coefficient decreases as the distance increases. However, according to the results in Ref. , with the increase of communication rounds, the update scale of local models gradually decreases. This means in this case of a constant the coefficient is larger in the early stage of training and smaller in the later stage of training. This is unexpected because a large number of experiments  show that the update direction of the global model is more informative in the early stage. And the varied direction of the global model is less helpful in the later stage of training since the global model is close to convergence. Hence, we consider an adaptive design for the hyper-parameter , which is defined as
at the local iteration, which can adjust the weights of the model-cosine loss. Note that we require since at the first local iteration, the current local model is initialized by the global model so that there is no updates and no model-cosine loss.
relates to the local model at the current moment, in order to reduce the local computational cost, we treat it as a constant, i.e.,. Thus, at the local iteration, we take the adaptive loss weights Eq. (12) into Eq. (11) and thus obtain
So far, by introducing adaptive weights, the weights of the two directions that play a guiding role in the local update will be determined by the distance of updates for the latest local iteration, which is expected that the weight of model-cosine loss decreases as the number of communication rounds increases.
In summary, our FedGG algorithm is described in Algorithm 1 without sampling mechanism. FedGG is still applicable in federated learning with the participation of non-persistent clients. In the case, we simply replace with the latest received global model except this round when calculating the direction of the global model update. Note that on line 19 in Algorithm 1, we adopt for the local update instead of , because we treat as a constant to reduce the local computational cost.
4.1 Experimental Setup
We compare FedGG with three state-of-the-art federated learning algorithms including FedAvg, FedProx, and Scaffold. In order to evaluate the performance of our algorithm, we test it on three different datasets, namely SVHN (99289 images with 10 classes), CIFAR-10 (60,000 images with 10 classes) and CIFAR-100 (60,000 images with 100 classes). Due to the difficulty in classification for these data sets , training networks with different complexities are selected. For SVHN, we adopt a three-layer MLP with ReLU activation, where the dimensions of the input layer, hidden layer and output layer are 33232, 512, and 10, respectively. For CIFAR-10, we use a simple CNN network, which is composed of two 55 convolution layers with 6 and 16 channels, respectively. They are followed by a 2
2 maxpooling and two fully connected layers whose dimensions are 120 and 84 , respectively. This network takes ReLU as an activation function. For CIFAR-100, we use VGG-16 for feature extraction and classification. For fair comparison, we adopt the same network structure for all algorithms.
In order to simulate the Non-IID data in the real world, similar to previous studies in Refs. [18, 20], we use vector with , and to represent the distribution of types of data for each client, where and the parameter can adjust the degree of Non-IID data. A less indicates a more unbalanced distribution of the data. Taking as an example, we show the distribution of data on 10 clients in Fig. 1.
We use Pytorch to implement FedGG and other algorithms. For training models, we use the SGD optimizer with a learning rate 0.01 and an SGD momentum 0.9. We set the number of epochsand the number of clients in each communication round. Besides, the batch size and the number of communication round is set to 100, at which point all algorithms barely improved on the three datasets.
4.2 Comparison on Accuracy
We evaluate the performance of the four algorithms on the three datasets with and
and each algorithm is repeated by three times. For FedGG, we let the hyperparametersand show that it achieves the best performance in all three data sets with . For FedProx, we set its hyperparameter to be 0.01, which is the best hyperparameter mentioned in Ref. . The highest tset accuracies of these algorithms over 100 communication rounds are given in Table 1 and Table 2.
From the results in Table 1 and Table 2, it can be concluded that FedGG achieves the highest test accuracy in all three data sets for both and . Especially, in the extremely unbalanced case ; i.e., the data distribution is closer to a practical situation, FedGG has a better performance, which is respectively , and higher than the second highest test accuracy in the three data sets. Furthermore, we notice that the performance of the same algorithm on the same data set degrades in the case of compared to , since the model divergence is more critical between clients. By introducing the model-cosine loss, the local model update is guided by the direction of global model update, which alleviates this problem.
4.3 Comparison on Communication Efficiency
The communication efficiency issue is extremely concerned in federated learning. In Fig. 2 we show the test accuracy in each round for and . As it shown, compared to the other three algorithms, FedGG has a certain degree of improvement on the convergence accuracy and speed. Especially in the early stage of training, the accuracy performance of FedGG is significantly better than that of the other algorithms, which greatly promotes the reduction of the communication costs. Similarly, in the case of , due to the model-cosine loss, the advantage of FedGG becomes obvious. It is worth mentioning that the optimal hyperparameter in FedProx is usually little, so it performs very similar to FedAvg. Scaffold introduces control variables, causing the data uploaded in each communication round is twice of the other algorithms.
In order to show the high communication efficiency of FedGG, we take the algorithm test accuracy of FedAvg at the communication round as a benchmark, and list the communication rounds required for the other algorithms to achieve this accuracy in Table 3 and Table 4 with and , respectively. We can observe that compared to the other algorithms, FedGG has a significant decrease in the number of communication rounds, especially on the CIFAR-10 dataset with . The speedup reaches 2.63 times, which greatly improves the communication efficiency.
4.4 Adaptive Loss Weights
To investigate the effect of the adaptive loss weights on model training, we test our algorithm in CIFAR-10 dataset with and set the coefficient of the adaptive loss weights as and the fixed-value loss weights as . The variations of the test accuracy with communication rounds is shown in Fig. 3. As it is shown, the best performances for the adaptive and fixed weights are and , respectively, and they perform very close to each other. The algorithm is stable under adaptive loss weights with different . However, the test accuracy diverges when the fixed-value weight which is only twice of the value for the best performance. This means that the adaptive weight has a larger adjustable range of the parameters compared to the fixed-value weights, which makes it more stable and easier to implement in practice. Since in the actual situation that the server generally has no test data, the global model can only be evaluated when the local training is over. At this time, a larger adjustable range of the parameter can be better tolerance for failure. Furthermore, although little values of both fixed-value weights and adaptive weights calculated from Eq. (12) indicate that the model-cosine loss has a little impact on the total local loss, the large order of magnitude for the gradient of the model-cosine loss makes it reasonably affect the local update.
In this paper, we have proposed global-update-guided federated learning (FedGG). By introducing model-cosine loss, the local model can fit the local data while update closely along the direction of the global model. This reduces the divergence between the local and global models and improves the performance of the model on Non-IID data. Besides, by introducing the adaptive loss weights, the adjustable range of model parameters is increased, making it easier and more stable to implement in practice. Our experiments show that, compared with the other state-of-the-art algorithms, FedGG’s accuracy increases by , and on Non-IID SVHN, CIFAR-10, and CIFAR-100 datasets. And the speedup in communication efficiency is up to 2.63 times while not increasing the amount of communication per round.
-  Q. Yang, Y. Liu, T. Chen, et al, Federated Machine Learning: Concept and Applications, ACM Transactions on Intelligent Systems and Technology (TIST), 10(2): 1–19, 2019.
-  M. Zhang, Y. Wang, T. Luo, Federated Learning for Arrhythmia Detection of Non-IID ECG, IEEE 6th International Conference on Computer and Communications (ICCC), 2020: 1176–1180.
-  Y. Guan, D. Li, S. Xue, et al, Feature-Fusion-Kernel-Based Gaussian Process Model for Probabilistic Long-Term Load Forecasting, Neurocomputing, 426(2): 174–184, 2021.
S. Jia, Z. Gan, Y. Xi, et al, A Deep Reinforcement Learning Bidding Algorithm on Electricity Market,Journal of Thermal Science, 29(5): 1125–1134, 2020.
Q. Guan, W. Li, S. Xue, et al, High-Resolution Representation Object Pose Estimation from Monocular Images,Chinese Automation Congress, 2021: 980–984.
-  Q. Guan, Z. Sheng and S. Xue, HRPose: Real-Time High-Resolution 6D Pose Estimation Network Using Knowledge Distillation, Chinese Journal of Electronics, accepted, 2022.
-  Z. Sheng, S. Xue, Y. Xu, et al, Real-Time Queue Length Estimation With Trajectory Reconstruction Using Surveillance Data, 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), 2020: 124–129.
-  Z. Sheng, Y. Xu, S. Xue, et al, Graph-Based Spatial-Temporal Convolutional Network for Vehicle Trajectory Prediction in Autonomous Driving, IEEE Transactions on Intelligent Transportation Systems, online published, 2022. DOI: 10.1109/TITS.2022.3155749
-  Z. Sheng, L. Liu, S. Xue, et al, A Cooperation-Aware Lane Change Method for Autonomous Vehicles, ArXiv Preprint, arXiv: 2201.10746, 2022.
-  S. AbdulRahman, H. Tout, H. Ould-Slimane, et al, A Survey on Federated Learning: The Journey from Centralized to Distributed On-Site Learning and Beyond. IEEE Internet of Things Journal, 8(7): 5476-5497, 2020.
-  P. Voigt and A.Bussche, The EU General Data Protection Regulation (GDPR), A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017.
-  B. McMahan, E. Moore, D. Ramage, et al, Communication-Efficient Learning of Deep Networks from Decentralized Data, Artificial Intelligence and Statistics, 2017: 1273-1282.
-  T. Li, K. Sahu, A. Talwalkar, et al, Federated Learning: Challenges, Methods, and Future Directions, IEEE Signal Processing Magazine, 37(3): 50-60, 2020.
-  X. Li, K. Huang, W. Yang, et al, On the Convergence of FedAvg on Non-IID Data, International Conference on Learning Representations (ICLR), 2019: 1-26.
-  T. Li, A. Sahu, M. Zaheer, et al, Federated Optimization in Heterogeneous Networks, Proceedings of Machine Learning and Systems, 2020: 429-450.
-  N. Shoham, T. Avidor, A. Keren, et al, Overcoming Forgetting in Federated Learning on Non-IID Data. NeurIPS Workshop, 2019: 1-6.
S. Hou, X. Pan, C. Loy, et al, Lifelong Learning via Progressive Distillation and Retrospection,
In Proceedings of the European Conference on Computer Vision (ECCV), 2018: 437-452.
-  X. Yao , L. Sun, Continual Local Training for Better Initialization of Federated Models, 2020 IEEE International Conference on Image Processing (ICIP), 2020: 1736-1740.
-  S. Karimireddy, S. Kale, M. Mohri, et al, SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In International Conference on Machine Learning, 2020: 5132-5143.
Q. Li, B. He, D. Song, Model-Contrastive Federated Learning,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021: 10713-10722.
-  B. Li, L. Han, Distance Weighted Cosine Similarity Measure for Text Classification, International Conference on Intelligent Data Engineering and Automated Learning, 2013: 611-618.
-  R. Geyer, T. Klein, M. Nabi, Differentially Private Federated Learning: A Client Level Perspective, NeurIPS Workshop, 2017: 1-7.