The protection of data privacy is increasingly critical in the big data era. It is not only a public concerns, but a rule enforced by laws such as General Data Protection Regulation (GDPR) in the European Union. Data massively distributed in devices (e.g. mobile phones, wearables, IoTs) or in organizations (e.g. hospitals, companies, courts) should not be collected in central server but stay locally in the form of isolated islands. The locally fixed data give rise to the challenge for machine learning. Federated learning (FL) (McMahan et al., 2016) is a machine learning setting where clients can collaboratively train a shared model under the orchestration of central server, while keeping the training data decentralized(Kairouz et al., 2019). It is an emerging technique that helps us out of the dilemma of “data island”, and derives many applications like mobile apps, autopilots, healthcares, and financial services etc. However, researches on federated learning are perplexed by some unique challenges, among which we focus on in this paper is the heterogeneous natures. Here we summarize them as the specific three types of heterogeneities:
Data Heterogeneity: Instead of the independent and identically distributed (IID) data in centralized machine learning task, the isolated data in a federated setting are in a Non-IID manner. It means that the data, , distributed in different clients may generated by distinct distribution . This statistical heterogeneity of data leads to significant accuracy reduction compared to IID data manner, which can be explained by the weight divergence(Zhao et al., 2018) at the stage of model weights averaging.
Model Heterogeneity: The prototypical federated learning algorithm (FederatedAveraging) (McMahan et al., 2016) where the global model is integrated by averaged weights , cannot fulfill the requirement of customized models for various devices such as mobile phones, wearables, and IoTs. Each client has different communication and computing capabilities, different representation of local data(Liang et al., 2020; Gao et al., 2019) , or the need of fast deployment to devices by neural architecture search (NAS)(Wu et al., 2019), for which they may require to design their own models. Besides, local model also involves privacy issues, which can be regarded as private property that should be protected from being stolen.
Objective Heterogeneity: The objective of federated learning is ambiguous. Is training a single model the right goal? On the one hand, federated learning aims to train a global model for all clients and new participants; On the other hand, clients participated in training process will compromise their personalities on achieving consensus. There is tradeoff between local and global model. Also, it is in doubt that whether the federated-pre-trained model should have to be an out-of-the-box (OOTB) product.
In this work, we propose a novel paradigm for federated learning, named Federated Mutual Leaning (FML), in response to these three heterogeneities, which is the main contribution of our method. First, FML deals with data heterogeneity by enabling training independent models, so that the Non-IIDness of data is no longer a bug but a feature that clients can be personally served better; Second, FML allows clients designing their customized models for various devices and protecting their model from being stolen; Third, local customized models can benefit from collaboratively training without compromising personalities meanwhile global model does not have to be an out-of-the-box (OOTB) product but a meta-learner which requires local adaptation for new participants. The experiments show that FML can achieve better performance, robustness and communication efficiency than alternatives.
2 Related Work
The canonical federated learning algorithm (FederatedAveraging) is to train a global model in a distributed manner. The difference between federated learning and distributed learning (usually refers to distributed training in data center) is whether the data of clients are fixed locally and cannot be accessed by others. This feature bring the safety of data privacy, but leads to the Non-IID and unbalanced data distribution that makes the training process harder. The difficulty of training Non-IID data is the accuracy reduction. Zhao et al. (2018) explains that due to the Non-IIDness, the fact of accuracy reduction can be understood in terms of weight divergence, which result in nonegligible deviation from correct updates at the stage of averaging. The author also proposes a data-sharing strategy by creating a small globally-shared subset of data. This strategy can effectively improve the accuracy, and for privacy safety, the shared-data can be extracted with distillation(Wang et al., 2018), or generated by generative adversarial network (GAN)(Chen et al., 2019). Many theoretical works are also made for FederatedAveraging focusing on convergence analysis and relaxing the assumptions in the Non-IID setting(Li et al., 2019a, b; Lian et al., 2017). However, these strategies cannot achieve comparable performance as in IID setting.
Although individual requirement of various clients bring about the model heterogeneity, it is also intimately related to the data heterogeneity since customized model of each client makes the Non-IIDness a feature rather than a bug in training processing(Kairouz et al., 2019). Smith et al. (2017) introduce a MOCHA framework for multi-task federated learning, dealing with high communication cost, stragglers, and fault tolerance. Khodak et al. (2019) presents an Average Regret-Upper-Bound Analysis (ARUBA) theoretical framework for analyzing gradient-based meta-learning. These frameworks allows separate models training but the architectures of model are still controlled by central server. Li and Wang (2019) proposes a decentralized framework based on knowledge distillation, which enables federated learning for independently designed models. However, this method requires a public dataset but have no global model for subsequent use. It also does not support new participation, since new participants may wreck established models.
The objective of the traditional Federated Learning is to train a global model which can be used for all the clients. However, in the personalization situation, Yu et al. (2020) shows that some participants may not benefit from the global model when the global model is less accurate than the local model. For some clients whose local dataset is small, the global model will be overfitted to these local data which influence its personalization ability. Jiang et al. (2019) point out that optimizing only for the global accuracy will make the model harder to personalize. Therefore, Jiang proposed three following objects for the personalization of Federated learning.(1) developing improved personalized models that benefit a large majority of clients; (2) developing an accurate global model that benefits clients who have limited private data for personalization; (3) attaining fast model convergence in a small number of training rounds.
The aim of typical federated learning is to learn a shared model over decentralized data. In the federated setting, data cannot be collected in central server and should be locally fixed on various devices, to protect data privacy. However, typical federated learning is not considerable and flexible, where the shared global model suffers from serious accuracy reduction in Non-IID manner, ; Local model have to be the same structure because the aggregation method is weighted averaging, ; local models must compromise their personalities on achieving consensus during training process, and the single global is an out-of- the-box (OOTB) product. The strategy to solve these problems is training distinct models for each clients. In this section, we introduce the Federated Mutual Learning (FML) for the federated setting, to handle these challenges of federated learning.
3.1 Typical Federated Learning Setup
The aim of typical federated learning (FederatedAveraging) is to learn a single shared model over decentralized data, which is to minimize the global objective
in the distributed manner, where the whole dataset is the union of each decentralized data, and the loss function is the sum of each private data point. Given private data, which is generated by distinct distribution from
clients, federated learning on each client starts with copying the weight vector
from the global model. Each clients then conducts local update that optimizing the local objective by gradient decent method for several epochs:
where is the loss function of the -th client, is the number of local samples, is the learning rate, and is the gradient vector. Note that the expectation may not hold because in the Non-IID setting. After a period of local updates, clients transmit local model weights to the parameter server, who then aggregates these weights by weighted averaging: , where is the weights of the global model, and is the number of samples over all clients. Repeat the whole training process until the global model gets convergence, and the single shared global model can learn from collaboratively training without sharing private local data. However, the training of local model is directly on the copy of the global model, so the typical federated learning will face the problems previously described. Training distinct models for clients is a natural way to solve these problems.
3.2 The Process of Deep Mutual Learning
In order to achieve training distinct models, we introduce the Deep mutual learning (DML) (Zhang et al., 2018)
. DML is a deep learning strategy derived from knowledge distillation(Hinton et al., 2015). Typical knowledge distillation is first proposed to transfer dark knowledge from a powerful large network or ensemble to a small network, in order to meet the low-memory or fast execution requirements. Different from the one-way knowledge transfer that from a static well-trained teacher to a untrained student in knowledge distillation, DML does not require a pre-trained teacher, that student and teacher can learn from each other throughout the training process.
The process of DML shows in Figure1: A set of training samples input to two deep networks and , and each model trains itself on the same data with two loss functions: prediction loss and consensus loss (usually Cross Entropy and Kullback Leibler (KL) Divergence), for fitting the same data and achieving consensus at the meantime:
are the logits of two networks. The aim of the two models is training themselves on the same data meanwhile distilling dark knowledge from each other. Repeat the training process, models trained in this way can achieve better performance than independent training. Note that the two networks need not be the same architecture, as well as the direction of knowledge transfer is two-way. With these two properties, DML can help achieve training distinct models in federated learning.
3.3 Federated Mutual Learning
Three factors draw our attention in FederatedAveraging:
First, the method suffers from bad performance in the Non-IID setting, especially in extremely unbalanced distribution, explained by weight divergence in (Zhao et al., 2018).
Second, models of each client must have the same architecture. This is reasonable for some applications like Mobile APPs, since all clients have been normalized like using similar devices. However, if clients are different devices such as vehicles, robots, or databases, who may need to design their own models for personal use and protect customized model, FederatedAveraging does not work.
Third, that the objective of FederatedAveraging is to train a single shared global model is not considerate and flexible. Even the performance of global model is acceptable, the personalities of clients have been neutralized for achieving global consensus. In addition, the single shared model has to be an out-of-the-box (OOTB) model, which is preferred for new participants that have few data, but it is reasonable to train global model as an meta-learner, who has better adaptability for clients that have not a few data.
In our consideration, the Non-IIDness should not be a bug in the training process, but the feature that we can serve each individual better. Thus we propose a novel federated learning paradigm, named Federated Mutual Leaning (FML), that deals with these problems. We consider federated learning as a knowledge transfer process across global model and local models, and implement it by deep mutual learning (DML). For each client in FML, there are two models inside (Figure 2): One is the meme model that is the medium of knowledge transfer between global and local models, and another is the customized local model that clients designed for privately use.
In the training process, FML starts with an initial global model, which is controlled by the central server. All clients start an initial customized local model or use the model provided by central server, which is optional on demand. Clients then fork the global model as its meme model and start local update. Instead of directly training the copy of the global model, the local update for each client is executing DML between meme model and local model for several epochs. We rewrite equation 2 as:
where and are the hyper-parameters. The direction of knowledge transfer between meme and local model is two-way, that meme model transfer global knowledge to local model and meanwhile local model gives feedback, and the two models are both training on private data. After finishing local update, each client pulls its trained meme model to central server and these meme models are merged into the global model. Repeat the whole process until convergence, and the algorithm shows as follows:
Note that the weighted average item in FederatedAveraging is abandoned in our approach, which discussed in Section 6 . From the global perspective, the global model is produced by averaging meme models, where the meme models have distilled knowledge from local models and private data. Thus the FML would degrade into typical FederatedAveraging if . From the local perspective, local models conduct continuously distillation process on local data, and distill knowledge from meme models at each round, shows in Figure 3. It is the merge of meme models that enables local models to benefit from collaboratively training. Throughout the whole process, local models are not replaced by the copy of global model and never leave the clients, so local models only serve clients privately, and the privacy of local model is protected.
After several periods of federated training, the global model does not have to be an out-of-the-box (OOTB) product. It is reasonable that some clients have no data, but FML can provide better personal service for clients with local private data, which are preferred to join the federated community, both for training and service. When a new participant join the federated community, the client can choose to initial a customized model or use the model provided by the central server and executes local adaptation by DML. The workflow of FML shows in Figure 4. The global model can be regarded as a meta-learner (Jiang et al., 2019) for different applications and tasks, where the federated training of global model can be reformulated as FO-MAML, which yield a global model that easier to personalize. Therefore, the new participant can initialize with a meta-learner that achieves better performance after local adaptation.
In this section we validate the performance, robustness and communication efficiency of FML. We show that FML can achieve better performance than FederatedAveraging both in IID and Non-IID setting. Non-IIDness of data becomes a feature that clients can be personally served in FML. Local customized model can benefit from collaboratively training without compromising personalities. FML also shows better robustness and communication efficiency than FederatedAveraging.
We are motivated by image classification task which is basic for various applications and devices. We test FML on three datasets with five types of models. Three datasets are MNIST, CIFAR10 and CIFAR100. Five types of model are multi-layer perceptron (MLP) in(McMahan et al., 2016), LeNet5 (LeCun et al., 1990)
, a convolutional neural networks (CNN1) with two 3x3 convolution layers (the first with 6 channels, the second with 16, each followed with 2x2 max pooling and ReLu activation), a convolutional neural networks (CNN2) with three 3x3 convolution layers (the three with 128 channels followed with 2x2 max pooling and ReLu activation), and ResNet18(He et al., 2016).
We divide the datasets into three subsets: training set, test set and validate set. The overall training set are balanced divided for each client, for example, 50000 training samples are split into five pieces of 10000 training samples. When each client receives training data, 90% of them (9000 samples) are use for private training set, and 10% of them (1000 samples) are used for private validate set. The 10000 test samples are used for testing global model.
We training five clients in both IID and Non-IID settings. In the IID setting, training data are shuffled before being partitioned, and randomly distribute to each client, where samples over each client ; In the Non-IID setting, we sort the data by digit label, each client is randomly assigned 2 partitions that only includes two classes, where samples over each client
, which belongs to label distribution skew(Kairouz et al., 2019). This extreme distribution lets us explore the degree to which our algorithms will break on highly non-IID data. Both of these partitions are balanced.
To validate performance of FML, we first use the same architecture of models in FML, in order to compare with FedAvg, who aims to train a single model. We conduct FML and FedAvg on five models with the same architecture over IID and Non-IID private training data. Table 1 shows FML achieve better performance than FedAvg over test set in all experiments.
In the Non-IID setting, the overall performance gets worse than that in the IID setting shows in Table 2. For FML, Non-IID is the feature that clients can be served better by local model. Figure 5 compares the performance of local models of FedAvg and FML over private validate set in IID and Non-IID settings.
In addition, we empirically demonstrate that FML allows less communication rounds than FedAvg, shows in Figure 5. The remaining results are shown in supplementary material.
This paper focus on the three types of heterogeneous natures that is data, model, and objective in canonical federated learning algorithm (FederatedAveraging). We present a new federated learning paradigm, Federated Mutual Leaning (FML), dealing with the Non-IIDness, distinct model architecture and ambiguous objective. Clients in FML can design their customized models and train independently, which make the Non-IIDness of data a feature that clients can be personally served better. Local customized models can benefit from collaboratively training without compromising personalities. Global model is regarded as a meta-learner which can achieve better performance after local adaptation for new participants. The experiments show the better performance, robustness and communication efficiency of FML than alternatives.
Since the FML can deal with model heterogeneity, various clients allows training models with different dimensions and architecture. The capabilities of models of various clients may be different. In our experiment, we observed a natural or social phenomena in some case, catfish effect, that models with low capabilities (sardines) can be improved by a model with high capability (catfish), compared to FML with only sardines. On the contrary, if there exists a badly-trained model in FML, the overall performance of other clients have little effect. This feature may derive some research about adversarial training in federated learning in the future.
In our experiments, the proportions of cross entropy loss and KL loss of local () and meme model () are fixed. However, we find it significantly important for training local and global model. Dynamic alphabeta at different stage of training can improve both global and local performance. From our experience, the improvement of local model attribute to well-trained global model at later stage, whereas the improvement of global model attribute to well-trained local models at early stage. Thus, a larger in early stage and a larger in later stage is preferred.
Privacy and Fairness
In this paper, we introduce the model privacy. FML allows customized models, which is also the private property of individuals, so that local customized model should be protected from being stolen. In addition, the average item in Section 3.3 is abandoned in consideration of privacy and fairness. One the one hand, the number of samples on each client should not be exposed to central server, since it might be an auxiliary for stealing privacy by attacker; on the other hand, the different leads to fairness problem, that client with large mount of samples will take a big part in model training, which is not appropriate in some applications. Instead, we abandon this item and consider each client as equal rather than each sample.
Federated learning is not only a technical standard, but also a “win-win” business model. Federated learning, as the underlying technology for Al’s development, can drive cross-disciplinary enterprise-level data cooperation and help enterprises participate in the globalization. The cross-silo federated learning can benefit from Federated Mutual Learning (FML), since clients may have rather sufficient data and need to design customized local model. On the contrary, the cross-device federated learning might not suitable for FML, since clients may not have sufficient data and the requirement of customized models.
- McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
- Kairouz et al. (2019) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
- Zhao et al. (2018) Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
- Liang et al. (2020) Paul Pu Liang, Terrance Liu, Liu Ziyin, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523, 2020.
- Gao et al. (2019) Dashan Gao, Ce Ju, Xiguang Wei, Yang Liu, Tianjian Chen, and Qiang Yang. Hhhfl: Hierarchical heterogeneous horizontal federated learning for electroencephalography. arXiv preprint arXiv:1909.05784, 2019.
- Wu et al. (2019) Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In
- Wang et al. (2018) Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
- Chen et al. (2019) Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang, Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3514–3522, 2019.
- Li et al. (2019a) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019a.
- Li et al. (2019b) Xiang Li, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Communication efficient decentralized training with multiple local updates. arXiv preprint arXiv:1910.09126, 2019b.
Lian et al. (2017)
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu.
Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
- Smith et al. (2017) Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017.
- Khodak et al. (2019) Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pages 5915–5926, 2019.
- Li and Wang (2019) Daliang Li and Junpu Wang. Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
- Yu et al. (2020) Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. Salvaging federated learning by local adaptation. arXiv preprint arXiv:2002.04758, 2020.
- Jiang et al. (2019) Yihan Jiang, Jakub Konečnỳ, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488, 2019.
- Zhang et al. (2018) Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- LeCun et al. (1990) Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.