FedMe: Federated Learning via Model Exchange

by   Koji Matsuda, et al.
Osaka University

Federated learning is a distributed machine learning method in which a single server and multiple clients collaboratively build machine learning models without sharing datasets on clients. Numerous methods have been proposed to cope with the data heterogeneity issue in federated learning. Existing solutions require a model architecture tuned by the central server, yet a major technical challenge is that it is difficult to tune the model architecture due to the absence of local data on the central server. In this paper, we propose Federated learning via Model exchange (FedMe), which personalizes models with automatic model architecture tuning during the learning process. The novelty of FedMe lies in its learning process: clients exchange their models for model architecture tuning and model training. First, to optimize the model architectures for local data, clients tune their own personalized models by comparing to exchanged models and picking the one that yields the best performance. Second, clients train both personalized models and exchanged models by using deep mutual learning, in spite of different model architectures across the clients. We perform experiments on three real datasets and show that FedMe outperforms state-of-the-art federated learning methods while tuning model architectures automatically.



There are no comments yet.


page 1

page 2

page 3

page 4


Federated Residual Learning

We study a new form of federated learning where the clients train person...

Implicit Model Specialization through DAG-based Decentralized Federated Learning

Federated learning allows a group of distributed clients to train a comm...

Communication-Efficient Federated Learning with Dual-Side Low-Rank Compression

Federated learning (FL) is a promising and powerful approach for trainin...

Perfectly Accurate Membership Inference by a Dishonest Central Server in Federated Learning

Federated Learning is expected to provide strong privacy guarantees, as ...

Federated Action Recognition on Heterogeneous Embedded Devices

Federated learning allows a large number of devices to jointly learn a m...

Graph-Assisted Communication-Efficient Ensemble Federated Learning

Communication efficiency arises as a necessity in federated learning due...

Dual Attention-Based Federated Learning for Wireless Traffic Prediction

Wireless traffic prediction is essential for cellular networks to realiz...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the growing popularity of mobile devices such as smartphones and tablets, an unprecedented amount of personal data has been generated. Such personal data are helpful to build machine learning models on a variety of applications such as action recognition (Anguita et al., 2013), next-word prediction (Hard et al., 2018), and wake word detection (Leroy et al., 2019). However, due to the concerns raised by data privacy and network bandwidth limitation, it is impractical to collect all local data from clients and train models in a centralized manner. To address the privacy concerns and network bandwidth bottleneck, federated learning has emerged as a decentralized learning paradigm to build a model without sharing local data on clients (McMahan et al., 2017).

Federated learning builds a model with a single server and multiple clients in a collaborative manner. Its general procedure consists of two steps: () client learning, in which clients train models on their local data and send their trained models to the server, and () model aggregation, in which the server aggregates those models to build a global model and distributes the global model to the clients. These two steps are repeated until the global model converges. This procedure effectively uses clients’ local data by sharing their trained models.

A challenge of federated learning.

One of the challenges in federated learning is on data heterogeneity: clients have local data that follow different distributions, i.e., they do not conform to the property of independent and identically distributed (IID) random variables. This causes difficulty in learning a single global model that is optimal for each client. Indeed, it has been reported that, in typical federated learning methods, model parameters of a global model are divergent when each client has non-IID local data 

(Li et al., 2020, 2019). Personalized federated learning methods have been proposed to deal with data heterogeneity (Mansour et al., 2020; Shen et al., 2020; T. Dinh et al., 2020; Zhang et al., 2021). These methods aim to build personalized models, which are optimized models for clients.

We have the following research questions for building optimal personalized models:

  • How to determine the model architectures of personalized models? In existing personalized federated learning methods, the server must tune model architectures in advance. Since the server is unaware of local data distributions on clients, the server needs to train multiple models with different architectures to tune model architectures remotely. However, this process requires high communication costs between the server and clients, making it impractical. Recently, an automatic architecture tuning method was proposed to automatically modify model architectures during learning process (He et al., 2020a). It tunes the architecture of the single global model by the server. However, it is likely that the model architecture tuned by the server is not optimal for each client, and the server is unable to evaluate the accuracy of the tuned model by using the local data of the clients. Therefore, each client should individually tune its model architecture, which may differ across clients due to the non-IID data (see Table 3 in our experimental study).

    To the best of our knowledge, there are no personalized federated learning methods that can automatically tune the model architecture during the learning process. Since each client is unaware of the local data on the other clients, we need means of leveraging other clients’ models to tune model architectures.

  • How does each client leverage other clients’ models with different architectures to improve its model accuracy? The server may not aggregate personalized models because their model architecture may differ across the clients. It is not effective to rely on the aggregation of models for leveraging other clients’ models. So, we need additional means of leveraging local data and models with different architectures.

Contributions. In this paper, we propose a novel federated learning method, federated learning via model exchange (FedMe for short). We propose a notion of exchanged models, i.e., each client can receive models sent from other clients. Then the clients are able to tune model architectures and train their models by utilizing the exchanged models. The learning process of FedMe addresses the aforementioned research questions. First, clients tune their model architectures based on the performance of exchanged models. To optimize the model architecture for local data, each client compares its own personalized model to the exchanged models and pick the one that yields the best performance. In this way, clients can automatically and autonomously modify their model architectures. Second, clients train both their own and exchanged models to improve both models, and the server aggregates the trained models of the same clients. We use two techniques for model training: deep mutual learning (Zhang et al., 2018) and model clustering. Deep mutual learning is effective in simultaneously training two models by mimicking the outputs of the models regardless of model architecture. Model clustering selects similar personalized models as exchanged models for each client, which prevents models from overfitting the noise caused by deep mutual learning. In doing so, the aggregated models can reflect the local data on other clients because they are trained by using other clients’ local data and the exchanged models.

We evaluate the performance of FedMe by comparing with state-of-the-art methods on three real datasets. Our experiments show that FedMe achieves higher accuracy than state-of-the-art methods even if we manually tune these methods for their best model architecture. Another interesting takeaway of the evaluation is that traditional federated learning methods with fine-tuning can build highly accurate personalized models on clients, which is not evaluated fairly in existing studies.

Organization. The remainder of this paper is organized as follows. In Section 2, we review related work. In Section 3, we define the problem. We then present our proposed method, FedMe, in Section 4, and report our empirical evaluation results in Section 5. In Section 6, we summarize the paper and discuss future work.

2. Related Work

The research on federated learning has been actively studied since McMahan et al. introduced federated learning (McMahan et al., 2017). Several survey papers summarize studies of federated learning (Kairouz et al., 2019; Lim et al., 2020; Mothukuri et al., 2021).

Numerous federated learning methods have been proposed recently. Thus, we describe only typical methods due to the page limitation. We review federated learning methods from three points of view; (1) data heterogeneity, (2) personalization and (3) model architecture tuning. Methods for data heterogeneity aim to appropriately build models in environments that clients have non-IID local data. Methods for personalization aim to build optimal personalized models for each client. Personalization has two types, homogeneous and heterogeneous, in which model architectures of all personalized models are the same and different, respectively. Methods with model architecture aim to automatically tune the model architecture.

The most basic method on federated learning is FedAvg (McMahan et al., 2017), which aggregates all trained models of clients by averaging their model parameters to build a single global model. Because the accuracy of FedAvg decreases in data heterogeneity, many methods have extended FedAvg to deal with data heterogeneity such as FedMA (Wang et al., 2020) and HierFAVG (Liu et al., 2019). Although these methods try to build a single global model by aggregating trained models, it is difficult to achieve high accuracy only by the single model.

Personalized federated learning methods have been proposed to build different models for each client (Mansour et al., 2020; T. Dinh et al., 2020; Zhang et al., 2021). These methods can increase the accuracy compared with methods that only build the single global model. We first reviews homogeneous personalization methods, which build personalized models with different parameters but their model architectures are the same. Mansour et al. proposed HypCluster and MAPPER (Mansour et al., 2020). In HypCluster, the server prepares several global models and distributes them to clients. Clients train only the model that has the highest accuracy and sends it back to the server. Then, the server aggregates each trained model as new global models. In MAPPER, clients compute the balancing weights of the global model and its trained model and then do a weighted sum of their parameters. T. Dinh et al. proposed pFedMe (T. Dinh et al., 2020), which builds global and local models by normalizing using the Moreau envelope function. These homogeneous personalization methods require the same model architecture for all personalized models, they cannot personalize their model architectures. We here note that these methods do not use fine-tuning (i.e., after finalizing the models on learning process, clients do not re-train the models by their local data), and also they do not compare simple methods with fine-tuning (e.g., FedAvg with fine-tuning). Our experiments show that most methods have lower accuracy than FedAvg with fine-tuning.

There are heterogeneous personalization methods that build personalized models with different parameters and architecture across clients (Li and Wang, 2019; Shen et al., 2020). Clients can choose arbitrary model architectures depending on the size of local data and their computation resources. Shen et al. proposed Federated Mutual Learning (FML) (Shen et al., 2020). The server in FML distributes the global model and clients train both of the global and their personalized models by deep mutual learning. We use the similar idea of FML on client training, but FedMe does not build the global model. Li et al. proposed FedMD (Li and Wang, 2019) which incorporates knowledge distillation into federated learning. FedMD needs public data, which is datasets similar to local data and can be used by all server and clients. These heterogeneous personalization methods can build personalized models with different architectures for each client. However, the server and clients need to determine model architectures before the learning process. In addition, in our experiments, these methods cannot achieve higher accuracy than non-personalized methods.

Model architecture search is a hot topic in deep learning fields, which searches the best model architecture among predefined search spaces (e.g., layer types and the maximum number of layers) (He et al., 2020b; Zoph and Le, 2017). FedNAS supports the network architecture search on federated learning (He et al., 2020a). Although it automatically tunes model architecture, it aims to build a single global model. It does not aim to build personalized models.

In summary, our method FedMe is the first method that can satisfy all the data heterogeneity, heterogeneous personalization, and model architecture tuning.

3. Problem Definition

In this section, we describe our problem definition. The notation used in this paper is summarized in Table 1.

Given a classification task, a server and a set of clients collaboratively build personalized models of clients. Let denote the set of clients. The number of clients is denoted by . We use a subscript for the index of the -th client. For example, is the local data of client , and is the size of (i.e., the number of records). denotes the sum of for all clients. and are features and labels of records contained in the local data, respectively. We assume classification tasks, so is assigned with a single class among classes. and are the total numbers of global communication rounds and local training rounds, respectively. Global communication means communication between the server and clients during training. Local training means that clients train the model on its local data. denotes an index of global communication round. and are the personalized and exchanged model of client in round , respectively. is an index of the original client of . For instance, returns if is a personalized model of client .


Symbol Description
a set of clients
an index of clients
th client’s local data
the size of
, a feature and label sampled from , resp.
the number of classes
the number of global communication rounds
an index of global communication rounds

the number of local training epochs

a personalized model of client at round
an exchanged model of client at round
an index of the original client of
models in cluster k
the number of clusters at round


Table 1. Summary of notation used in this paper.
Figure 1. FedMe framework.

In FedMe, each client builds its personalized model instead of a single global model. We define the optimization problem as follows:


is the personalized objective for client , and is defined as follows:


where :

is the loss function of client

, corresponding to and . is the space of models, and is not fixed. This optimization problem is similar to that of (Zhang et al., 2021). In  (Zhang et al., 2021), since the model architecture of personalized model is fixed, the size of is predetermined and fixed. In contrast, the size of in our problem is also optimized, which indicates that our problem aims to optimize the model architectures of personalized models. By solving this optimization problem, we can build optimal personalized models for each client.

4. Methodology

In this section, we describe our proposed method, FedMe. We first explain the overall idea and the framework of FedMe. After that, we present an algorithm and its technical components in detail. Finally, we shows a concrete example of our algorithm.

4.1. Idea and Framework

FedMe is a heterogeneous personalized federated learning method with automatic model architecture tuning. Recall our research challenges; how we automatically tune optimal model architectures for clients and how clients use models with different architectures for improving their models. For solving these research challenges, our idea is simple; clients receive models of other clients for leveraging model architecture tuning and send their models to other clients for training models by local data on other clients. In other words, clients exchange their models for model architecture tuning and model training.

FedMe effectively leverages exchanged models by the following ways. First, clients tune their personalized models based on the performance of exchanged models. In more concretely, clients replace their personalized models with the exchanged models if the exchanged models have smaller loss on their local data than their personalized models. Each client can automatically tune its model architectures so that the accuracy on its local data improves. Second, clients train both their personalized and exchanged models and the server aggregates the trained models of the same client. This achieves model training that can train personalized models with different architectures. Third, clients simultaneously and effectively train the both personalized and exchanged models by using deep mutual learning and model clustering. Deep mutual learning simultaneously trains two models by mimicking outputs of models each other regardless of the model architecture. The output of the other models may become noise and may overfit the noise when models are trained by significantly different local data (Chen and Chao, 2021). To prevent models from overfitting the noise, the server performs model clustering to select models with similar outputs. The model clustering groups model into subsets of models that have similar outputs by using Kmeans method  (MacQueen and others, 1967).

Figure 1 shows a framework of FedMe. FedMe has five learning processes; (0) each client creates its personalized model with arbitrary model architectures, (1) clients send their personalized models to the server, (2) the server decides exchanged models for the clients based on model clustering and sends the exchanged models to clients, (3) clients train both their personalized and exchanged models by deep mutual learning and tune their personalized models based on the performance of the exchanged models, and (4) after clients send back the trained exchanged and personalized models, the server aggregates personalized and exchanged models for all clients and then send their aggregated personalized models to clients. FedMe repeats (1)–(4) until the number of global communication reaches a given threshold .

4.2. Algorithm

In this section, we describe the algorithm of learning procedures of FedMe. The pseudo-code of FedMe is shown in Algorithm 1. After clients initialize their models (line ), FedMe starts its learning process. First, clients send their personalized model to the server (line ). The server clusters the models using unlabeled data (line ), and each client receives a model that belongs to the same cluster as an exchanged model from the server (lines ). Each client then trains the personalized and exchanged models (line ) and determines the index of its new personalized model (line ). Each client sends two trained models and to the server (line ). The server aggregates each model by averaging their parameters (line ) and sends them to each client based on (line ). These steps are repeated until the number of global communication rounds becomes .

We explain detailed procedures of initialization, model training, model tuning, model clustering, and model aggregation in the following.

Input: number of global communication rounds , number of local training epochs , set of clients and their local data , unlabeled data , numbers of cluster {}, learning rate

Output: personalized models

1:Initialize() on all client
2:for  do
3:     for  do
4:         Client sends to server
5:     end for
6:     (, , )
7:     for  do
9:         Server sends to client
10:         for  do
11:              (, , )
12:         end for
13:         (, , )
14:         Client sends to server
15:     end for
16:     (, )
17:     for  do
18:         Server sends aggregated to client
19:     end for
20:end for
Algorithm 1 Algorithm of FedMe.

Initialization. FedMe first requires the initialization of model architectures to clients. Since clients can use arbitrary model architectures in FedMe, they can determine their model architecture depending on their local data. For example, clients build optimal models on their local data. Of course, the server can determine arbitrary models and distribute them to clients.

Model Training. In FedMe, each client exchanges its personalized model, and thus the model is trained on the local data of multiple clients, which enables training models even when clients have models with different architecture.

Each client trains personalized and exchanged models on its local data by deep mutual learning. Deep mutual learning between personalized and exchanged models improves accuracy compared to training them independently. Indeed, it is known that deep mutual learning effectively improves the inference performance of models when we use numerous models for training (Zhang et al., 2018). Therefore, deep mutual learning has significant benefits on the learning process of FedMe.

We define loss functions and of the personalized and exchanged models, respectively, as follows:


where and are the predictions of the personalized and exchanged models for class , respectively. The first and second terms of these equations are the cross-entropy error and the Kullback Leibler (KL) divergence, respectively. The function returns 1 if and returns 0 otherwise.

To minimize the above loss functions, client updates the two models.


where is learning rate, and and is gradient of personalized and exchanged models, respectively.

Model Tuning. Clients can use models with any model architectures, which enabling clients to modify their models freely. In the learning process of FedMe, each client has many opportunities to optimally modify its model because it receives models of other clients as exchanged models at each global communication round. If the exchanged models achieve higher performance than the current models, they tune their models based on the exchanged models.

FedMe does not restrict means of model architecture tuning. In this paper, to validate the performance of design of FedMe, we use a simple tuning method which replaces their personalized models with the exchanged models. In more concretely, after each client trains models through deep mutual learning, it selects either its personalized or the exchanged models at round . FedMe computes , which represents an index of personalized model that client selects, as follows:


In this equation, each client compares the loss of the personalized and exchanged models and then replaces the personalized model with the exchanged model if the exchanged model has smaller loss than the personalized model.

Of course, we can use other tuning methods instead of replacements, for example increasing the number of layers. Additionally, though we here consider the loss to tune the model architecture, each client can have its own criteria, such as the size of models and inference time. We remain optimal model tuning methods on FedMe as future work.

Model Clustering. Due to the data heterogeneity among clients, the outputs of the personalized models differ among clients. If clients perform deep mutual learning between models with significantly different outputs, the models may overfit the noise. In FedMe, models are clustered based on their outputs, and each client receives a model with similar output as an exchanged model from the server.

Model clustering reduces the difference between the output of the own model and that of the other model, thus preventing overfitting the noise (Gao et al., 2017). On the other hand, continuous training of models with similar outputs may lead less generality. Therefore, in the early stages of training, we do not perform model clustering to increase the generality of the models. As training progresses, we increase the number of clusters in the model clustering. In this way, the model can be personalized without overfitting while maintaining its generality.

Since federated learning does not share local data, we cannot use local data for model clustering. Therefore, FedMe assumes that the server has access to unlabeled data, such as one-shot federated learning  (Guha et al., 2019), and uses unlabeled data as input.

We use the Kmeans method (MacQueen and others, 1967) to cluster models. The server first computes the outputs of the models using unlabeled data. The server then uses kmeans with those outputs and divides models into clusters.

In the model exchange, each client receives a model of the same cluster as its own personalized model as exchanged models from the server at random. Here, if there is only one model in the cluster, the client receives a model from other clusters at random.

Model Aggregation. Client trains and simultaneously. Therefore, it is necessary to aggregate all of them into a new model for each client. FedMe aggregates the models by averaging the model parameters as in FedAvg.


where is the total number of clients that receive as the exchanged models. Also, represents which clients received the personalized model as an exchanged model and is defined by the following equation:


The model parameters are averaged and aggregated for each client’s personalized model so that the aggregation is independent of the difference of model architecture.

4.3. Running Example

Figure 2. A running example of FedMe.

We explain the FedMe algorithm using concrete examples. We assume that the number of clients is five and the number of global communication rounds is two. The number of clusters is initially one and increases by one at each global communication round. Figure 2 illustrates the procedures of FedMe at the first and second global communication rounds.

Initialization: Clients initialized their personalized models. In this example, each client selects model architectures depending on their local data and sets up each of these models as an initial personalized model as .

First round: Clients send their personalized models to the server. In the first global communication round, since the number of clusters is one, the server randomly selects exchanged models for each client. In this example, clients receive , , , , and from the server as their exchanged models, respectively.

Each client trains its personalized and exchanged models by deep mutual learning, and update , respectively. Next, each client compares the loss of the two models on its local data. Suppose that the personalized model of clients , , and have smaller losses than their exchanged models, while the exchanged models of other clients have smaller losses. Thus, is , , , , and , respectively.

All clients send the two trained models and to the server. The server then aggregates each personalized model. For example, the server aggregates and , by averaging their parameters. The server sends back , , , , and to clients , respectively, according to . Each client sets up each of these models as a new personalized model, .

Second round: We perform the second global communication round. Clients send their personalized models to the server. In the second global communication round, since the number of clusters is two, the server clusters the personalized models into two clusters using unlabeled data, and we assume that , and belong to the same cluster, respectively. Clients receive , , , , and from the server as their exchanged models according to model clustering, respectively.

Then, each client trains its personalized and exchanged models by deep mutual learning and updates to , respectively. We assume that is , , , , and in the second round.

All clients send the two trained models and to the server. The server then aggregates each personalized model. The server sends back , , , , and to clients , respectively, based on , and each client sets up each of these models as its new personalized model. are the final personalized models.

5. Experiments

In this section, we test the accuracy of FedMe on three datasets with high degree of data heterogeneity. In our experiments, we aim to answer the following questions;


How accurate is the inference of FedMe compared with the state-of-the-art methods?


Does automatic model architecture tuning work well?


What techniques of FedMe impacts to the accuracy?


How fast is the learning process of FedMe compared with the state-of-the-art methods?


What is the impact of data heterogeneity and fine-tuning on FedMe?

To simplify the experiments, we use Pytorch 

(Paszke et al., 2019) to create a virtual client and server on a single GPU machine.

5.1. Experimental Setup

5.1.1. Datasets, Tasks, and Models

In the experiment, we use three settings; FEMNIST, CIFAR-10, and Shakespeare. These datasets are frequently used in existing works 

(Chen and Chao, 2021; Li and Wang, 2019; Li et al., 2020; Mansour et al., 2020; McMahan et al., 2017; Wang et al., 2020).

  • FEMNIST: we use the Federated EMNIST-62 datset (Caldas et al., 2018), which includes images of handwritten characters with labels. This dataset is divided into 3,400 sub data based on writers. We conduct an image classification task.

  • CIFAR-10: We use CIFAR-10 dataset (Krizhevsky et al., 2009), which includes photo images with 10 labels. we divide the dataset into sub data using the Dirichlet distribution as in (Wang et al., 2020). We set two parameters and to decide the degree of heterogeneity of the size of local data and labels, respectively. We use 0.5 and 10 as and , respectively. We conduct an image classification task.

  • Shakespeare: We use Shakespeare dataset (Li et al., 2020), which includes lines in “The Complete Works of William Shakespeare”. This dataset is divided into 143 sub data based on actors. We conduct a next-character prediction that infers next characters after given sentences.

Table 2 shows the statistics of the number of records on clients in datasets. We here note that we randomly divide CIFAR-10 in each test, so the statistics of CIFAR-10 is an example value.

We use different models for each setting following the existing works (Reddi et al., 2020; Wang et al., 2020). For FEMNIST and Shakespeare, we use CNN and LSTM, respectively (Reddi et al., 2020). For CIFAR-10, we use VGG with the same modification reported in (Wang et al., 2020). In each setting, we use four models varying the number of layers. For CNN and LSTM, we vary the number of convolution and LSTM layers from one to four, and the default value is two. For VGG, we use VGG, VGG, VGG, and VGG, and the default is VGG.


Datasets Total num Mean STD Max Min


Table 2. Datasets Statistics.

5.1.2. Training and test

In our experiments, the number of clients is 20. In FEMNIST and Shakespeare, we select sub data for assigning local data on clients randomly. In CIFAR-10, we randomly divide the whole data into 20 local data. All clients participate in each global communication round following recent works (Wang et al., 2020). We select 1,000 unlabeled data from each dataset. The unlabeled data was excluded from the train and test data. We divide each local data into training and test data by the ratio of 9:1, 8:2, and 5:1 for FEMNIST, CIFAR-10, and Shakespeare, respectively. Furthermore, we divide the training data into for FEMNIST and Shakespeare, and into for CIFAR-10, which are used as training and validation data, respectively.

We set the number of global communication rounds to , , and for FEMNIST, CIFAR-10, and Shakespeare respectively, and set the local epoch to

for all setting. We conduct training and test five times and report mean and standard deviation (std) of accuracy over five times of experiments with different clients.

5.1.3. Baselines and hyperparameter tuning

We compare FedMe with three types of methods: () non-personalized federated learning methods, () personalized federated leaarning methods, and (3) non-federated learning methods. For (), we use FedAvg, and for (), we use HypCluster, MAPPER, FML, and pFedMe. For (3), we use Local Data Only, in which clients build their models on their model, and Centralized, in which a server collect local datasets from all clients (centralized can be considered as oracle). We use fine-tuning on each client for Centralized, FedAvg, HypCluster, and FedMe after building their models. In MAPPER and pFedMe, we do not use fine-tuning since their algorithms include the similar techniques to fine-tuning. We implement all methods except for pFedMe111https://github.com/CharlieDinh/pFedMe because these codes are not available.

We explain hyperparameter tuning. The learning rate is optimized for each method by grid search using a grid of

. The optimization method is SGD (stochastic gradient descent) with momentum

and weight decay . The batch sizes of FEMNIST, CIFAR-10, and Shakespeare are , , and , respectively. In Hypcluster, we use two global models. In FedMe, we initialize model architectures of clients as the best accurate model on Local Data Only among model 1–4 (see Table 3). We set the range of number of clusters to ; we initially use the number of clusters as and increase it by 1 at global communication round at , , and for FEMNIST, CIFAR-10, and Shakespeare, respectively.


FEMNIST CIFAR-10 Shakespeare


Table 3. Average number of clients that select each model architecture based on their own local data.

5.2. Experimental Results


FEMNIST CIFAR-10 Shakespeare
Local Data Only


Table 4. Test accuracy (meanstd).

(b) CIFAR-10
(c) Shakespeare
Figure 3. The validation accuracy over time of various methods.

We show experimental results to answer the five questions.

5.2.1. Q1. How accurate is the inference of FedMe compared with the state-of-the-art methods?

Table 4 and Figure 3 show the accuracy of FedMe and baselines. Table 4 shows average accuracy and standard deviation, and Figure 3 shows the validation accuracy at each global communication round.

From Table 4, we can see that FedMe achieves the highest accuracy among federated learning methods for all setting and its accuracy is very close to accuracy of Centralized. We here note that standard deviations of FEMNIST and Shakespeare are relatively large because clients differ in each test. FedMe achieves the lowerest (or the runner-up) standard deviation among federated learning methods for all settings, so we confirm that FedMe is the most robust among them. This result shows that its learning process is effective.

Comparing the baselines, it is interesting in that FedAvg, which is the most simple method with fine-tuning, achieves the highest accuracy among baselines. This result indicates that data heterogeneity can be solved by fine tuning. We show more experiments related to data heterogeneity and fine tuning in Section 5.2.5.

From Figure 3, we can see that FedAvg and FedMe generally have high accuracy at early rounds. This indicates that FedMe is early converge as the same as FedAvg.

5.2.2. Q2. Does automatic model architecture tuning work well?

We here show how well FedMe tunes optimal model architecture automatically. Table 5 shows the accuracy of FedMe with fixed model architectures. Model 1–4 are 1–4 CNN layers in FEMNIST, (VGG, VGG, VGG, and VGG) in CIFAR-10, and 1–4 LSTM laysers in Shakespeare, respectively. For all setting, the accuracy of auto-tuning is middle among model 1–4. In particular, in FEMNIST, the accuracy of auto-tuning is comparable to the highest accuracy of the pre-determined model architecture. This result indicates that automatic model architecture tuning is effective without pre-defining the model architectures, so we can remove the cost to manually tune the model architecture.


FEMNIST CIFAR-10 Shakespeare


Table 5. Impact of automatic model architecture tuning.


75.853.27 88.190.53 37.273.42
76.463.28 89.761.21 45.594.04
76.294.22 86.922.63 42.955.05
75.873.60 88.140.93 36.862.54
76.133.71 86.732.54 43.301.24
78.762.26 89.670.87 45.223.70
77.643.52 89.791.23 46.323.33
78.522.64 89.760.90 44.711.12


Table 6. Comparison of test accuracy when removing each optimization technique of FedMe. MT, MC, and DML indicate model tuning, model clustering, and deep mutual learning, respectively.


Centralized w/o fine-tuning
Centralized w/ fine-tuning
FedAvg w/o fine-tuning
FedAvg w/ fine-tuning
HypCluster w/o fine-tuning
HypCluster w/ fine-tuning
FedMe w/o fine-tuning
FedMe w/ fine-tuning


Table 7. Impact of data heterogeneity and fine-tuning.

5.2.3. Q3. What techniques of FedMe impact to the inference accuracy?

We investigate the impact of optimization techniques of FedMe to the accuracy. FedMe uses the three optimization techniques; model tuning (MT), deep mutual learning (DML), and model clustering (MC). Table 6 shows the results that FedMe either partially or fully uses optimization techniques.

From Table 6, we can see that the accuracy of FedMe with all optimization techniques is higher than that without all techniques. We first see how much the accuracy improves when FedMe uses a single optimization technique. The model tuning has the most impact among the optimization techniques in all setting. Since they use accurate models more than initial models, the accuracy improves. Deep mutual learning also improves the accuracy except for CIFAR-10. The result indicates that deep mutual learning is effective in mutually learning personalized and exchanged models for leveraging predictions of models. Different from model tuning and deep mutual learning, model clustering does not improve the accuracy. We design the model clustering to combine deep mutual learning, so the model clustering itself is not effective.

Next, we investigate the combinations of optimization techniques. The accuracy of FedMe with two optimization techniques is generally higher than that of FedMe with a single optimization technique. The result indicates that each optimization technique has effective interaction to improve the accuracy. For example, in Shakespeare, FedMe with deep mutual learning and model clustering achieves higher accuracy than that with deep mutual learning though model clustering itself is not effective. While, some combinations decrease the accuracy, for example, DML+MC in FEMNIST and MT and DML in CIFAR-10. This deterioration is caused by ineffectiveness of model tuning and model clustering methods. We have research opportunities to improve the accuracy more, so we remain these tasks as our future work.

5.2.4. Q4.How fast is the learning process of FedMe compared with the state-of-the-art methods?

We evaluate run time on training phase in each method. Figure 4 shows the average run time of client and server process on global communication rounds. From this result, we can see that FedMe’s running time on clients is competitive with other methods, though FedMe trains two models. On the other hand, FedMe’s running time on server is larger than other methods because the server on FedMe uses model clustering after obtaining the outputs of all the personalized models. This is a time-consuming task compared with other methods. Note that we use the same hardware for the server and the clients in our experiments, while the server generally has more powerful computing resources in real-world scenarios. This means the computation cost on the server tends to be smaller in real-world applications. In addition, we can also control the computation cost of the server by changing the size of unlabeled data. Thus, seeing the accuracy gain of FedMe, we believe that the computation cost on the server is affordable.

(a) Client
(b) Server
Figure 4. Run time per global communication round.

5.2.5. Q5. What is the impact of data heterogeneity and fine-tuning on FedMe?

We finally investigate the impact of data heterogeneity. We conduct experiments using the CIFAR-10 dataset varying the degree of data heterogeneity controlled by . The smaller indicates greater data heterogeneity, and IID indicates that data distribution and the size of local data on clients are the same. We compare FedMe with Centralized and the top three most accurate existing methods in Table 4, FedAvg, HypCluster, and pFedMe. Table 7 shows the accuracy of each method with and without fine-tuning. First, FedAvg and HypCluster are equally accurate and have the highest accuracy in IID. The result indicates that when data is distributed in IID, it is enough to average model parameters of models on clients. We can also see that fine-tuning is not effective in IID. As

decreases (i.e., data heterogeneity becomes greater), the accuracy of methods without fine-tuning decreases, but that of methods with fine-tuning increases. This is because labels of local data have skews, training and test datasets have similar labels. The accuracy of pFedMe, which is the personalized federated learning method, also increases as

decreases. However, pFedMe always reports the worst performance (compared to other fine-tuning methods) even though it was designed as a solution to high degree of data heterogeneity and includes techniques similar to fine-tuning. This result was not observed in previous studies. For FedMe, when equipped with fine-tuning, it is the best (or the runner-up) method when local data is not IID. FedMe without fine-tuning is also the best (or the runner-up) method among the methods without fine-tuning. These results show that FedMe works well for high degree of heterogeneity and demonstrate the robustness of FedMe when fine-tuning is absent.

6. Conclusion and Future Work

In this paper, we presented FedMe, a novel federated learning method that builds personalized models with automatic model architecture tuning. In FedMe, clients exchange their models to tune and train their personalized models. FedMe can train models with different architectures by exchanging models and deep mutual learning. Our experiments showed that FedMe is more accurate than the state-of-the-art methods and can automatically tune the model architecture.

As our future work, we plan to extend model tuning and model clustering methods for tuning model architecture more flexibly. Although FedMe automatically tunes the model architecture, the candidates are only model architectures that clients design in advance. Thus, FedMe may not work well if optimal model architectures are not designed. We can improve model tuning methods for tuning models more flexibly, such as network architecture search. Additionally, model clustering is not effective much, so we can extend it to improve the accuracy.


  • [1] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz (2013) A public domain dataset for human activity recognition using smartphones. In

    European Symposium on Artificial Neural Networks

    Vol. 3, pp. 437–442. Cited by: §1.
  • [2] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar (2018) Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097. Cited by: 1st item.
  • [3] H. Chen and W. Chao (2021) FedBE: making bayesian model ensemble applicable to federated learning. In International Conference on Learning Representations, Cited by: §4.1, §5.1.1.
  • [4] J. Gao, Z. Li, R. Nevatia, et al. (2017)

    Knowledge concentration: learning 100k object classifiers in a single cnn

    arXiv preprint arXiv:1711.07607. Cited by: §4.2.
  • [5] N. Guha, A. Talwalkar, and V. Smith (2019) One-shot federated learning. arXiv preprint arXiv:1902.11175. Cited by: §4.2.
  • [6] A. Hard, K. Rao, R. Mathews, S. Ramaswamy, F. Beaufays, S. Augenstein, H. Eichner, C. Kiddon, and D. Ramage (2018) Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604. Cited by: §1.
  • [7] C. He, M. Annavaram, and S. Avestimehr (2020) FedNAS: federated deep learning via neural architecture search. In CVPR 2020 Workshop on Neural Architecture Search and Beyond for Representation Learning, Cited by: 1st item, §2.
  • [8] C. He, H. Ye, L. Shen, and T. Zhang (2020) Milenas: efficient neural architecture search via mixed-level reformulation. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 11993–12002. Cited by: §2.
  • [9] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §2.
  • [10] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images.. technical report. Cited by: 2nd item.
  • [11] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau (2019) Federated learning for keyword spotting. In International Conference on Acoustics, Speech and Signal Processing, pp. 6341–6345. Cited by: §1.
  • [12] D. Li and J. Wang (2019) Fedmd: heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581. Cited by: §2, §5.1.1.
  • [13] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks. In Machine Learning and Systems, Vol. 2, pp. 429–450. Cited by: §1, 3rd item, §5.1.1.
  • [14] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2019) On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, Cited by: §1.
  • [15] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y. Liang, Q. Yang, D. Niyato, and C. Miao (2020) Federated learning in mobile edge networks: a comprehensive survey. IEEE Communications Surveys & Tutorials. Cited by: §2.
  • [16] L. Liu, J. Zhang, S. Song, and K. B. Letaief (2019) Edge-assisted hierarchical federated learning with non-iid data. arXiv preprint arXiv:1905.06641. Cited by: §2.
  • [17] J. MacQueen et al. (1967) Some methods for classification and analysis of multivariate observations. In

    Berkeley symposium on mathematical statistics and probability

    Vol. 1(14), pp. 281–297. Cited by: §4.1, §4.2.
  • [18] Y. Mansour, M. Mohri, J. Ro, and A. T. Suresh (2020) Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619. Cited by: §1, §2, §5.1.1.
  • [19] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §1, §2, §2, §5.1.1.
  • [20] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha, and G. Srivastava (2021) A survey on security and privacy of federated learning. Future Generation Computer Systems 115, pp. 619–640. Cited by: §2.
  • [21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8026–8037. Cited by: §5.
  • [22] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečnỳ, S. Kumar, and H. B. McMahan (2020) Adaptive federated optimization. arXiv preprint arXiv:2003.00295. Cited by: §5.1.1.
  • [23] T. Shen, J. Zhang, X. Jia, F. Zhang, G. Huang, P. Zhou, K. Kuang, F. Wu, and C. Wu (2020) Federated mutual learning. arXiv preprint arXiv:2006.16765. Cited by: §1, §2.
  • [24] C. T. Dinh, N. Tran, and J. Nguyen (2020) Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems, pp. 21394–21405. Cited by: §1, §2.
  • [25] H. Wang, M. Yurochkin, Y. Sun, D. Papailiopoulos, and Y. Khazaeni (2020) Federated learning with matched averaging. In International Conference on Learning Representations, Cited by: §2, 2nd item, §5.1.1, §5.1.1, §5.1.2.
  • [26] M. Zhang, K. Sapra, S. Fidler, S. Yeung, and J. M. Alvarez (2021) Personalized federated learning with first order model optimization. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3.
  • [27] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4320–4328. External Links: Document Cited by: §1, §4.2.
  • [28] B. Zoph and Q. V. Le (2017)

    Neural architecture search with reinforcement learning

    In International Conference on Learning Representations, Cited by: §2.