DeepAI
Log In Sign Up

Subspace Learning for Personalized Federated Optimization

09/16/2021
by   Seok-Ju Hahn, et al.
UNIST
0

As data is generated and stored almost everywhere, learning a model from a data-decentralized setting is a task of interest for many AI-driven service providers. Although federated learning is settled down as the main solution in such situations, there still exists room for improvement in terms of personalization. Training federated learning systems usually focuses on optimizing a global model that is identically deployed to all client devices. However, a single global model is not sufficient for each client to be personalized on their performance as local data assumes to be not identically distributed across clients. We propose a method to address this situation through the lens of ensemble learning based on the construction of a low-loss subspace continuum that generates a high-accuracy ensemble of two endpoints (i.e. global model and local model). We demonstrate that our method achieves consistent gains both in personalized and unseen client evaluation settings through extensive experiments on several standard benchmark datasets.

READ FULL TEXT VIEW PDF
10/12/2022

Find Your Friends: Personalized Federated Learning with the Right Collaborators

In the traditional federated learning setting, a central server coordina...
12/15/2020

Personalized Federated Learning with First Order Model Optimization

While federated learning traditionally aims to train a single global mod...
05/13/2022

Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints

Federated learning systems facilitate training of global models in setti...
10/13/2021

WAFFLE: Weighted Averaging for Personalized Federated Learning

In collaborative or federated learning, model personalization can be a v...
11/19/2022

Personalized Federated Learning with Hidden Information on Personalized Prior

Federated learning (FL for simplification) is a distributed machine lear...
06/30/2022

Cross-domain Federated Object Detection

Detection models trained by one party (server) may face severe performan...
08/31/2021

GRP-FED: Addressing Client Imbalance in Federated Learning via Global-Regularized Personalization

Since data is presented long-tailed in reality, it is challenging for Fe...

Introduction

For the blessing of advanced communication and computation technologies, individuals and institutions are nowadays data producers as well as data keepers. Thereby, training a machine learning model under a data-centralized setting is sometimes not viable. If there exists too many clients generating their own data in real-time on their devices, it is difficult to collect them. Moreover, such data may contain private information that is not possible to share or transfer to other places. Federated learning (FL)

ko+16; mc+17 is a solution to solve this problematic situation as it enables massive parallel training of a machine learning model collaboratively across clients or institutions without sharing their private data, under the orchestration of the central server (e.g., service providers). In the most common FL setting, such as FedAvg mc+17, each participant trains a local model with its own data and transmits resulting model parameters to the central server. Then, the central server aggregates (i.e., weighted averaging) local parameters to update a global model and then broadcast this new global model to participating clients. This process is repeated until convergence.

However, a single global model is not enough to provide satisfiable experiences to all clients, due to the curse of data heterogeneity. Since there exists an inherent difference among clients (e.g., culture, country, age, etc.), the data resides in each client is not statistically identical, in other words, the data is not independent and identically distributed (non-IID). In this situation, applying a simple FL method like FedAvg cannot avoid a pitfall of the high generalization error, or even divergence of the global model ka+19. When this occurs, clients barely have the motivation on participating in federated optimization to train a single global model, as it is far better to train and exploit a local model within themselves using its own data.

As training a single global model via FL cannot be a panacea for all clients, it is natural to rethink the scheme of FL. As a result, a personalized FL (PFL) becomes an essential component of FL, which has yielded many related studies. Many approaches are proposed for PFL, such as multi-task learning mocha; mar+21; fedu, modified optimization scheme based on a regularization l2sgd; pfedme; fedamp, local fine-tuning Fallah; jiang+19; khodak+19, clustering ghosh+20; clustered; mansour+20; fedgroup, and a model mixture FedPer; lgfedavg; mansour+20; apfl. Among them, the model mixture-based method is attracting attention for its flexibility and extensibility of the model structure along with promising performance. It separates the model into the local part and global part, expecting that the former captures heterogeneous local data distribution while the latter focuses on the information common across clients. Thereby, each participating client ends up having an explicit personalized model (i.e., local model) as a result of FL.

Meanwhile, this model mixture approach can be basically viewed as a type of ensemble learning, as the core idea is to interpolate two different models. However, a slight difference exists: a local model can only access the local data belong to a specific client, while a global model can implicitly gain diverse information from different client’s local data through a model aggregation at the central server. While this is typically not true in ensemble learning, since models participating in the ensemble can usually access identical data for training. As many ensemble strategies have been proposed in the context of deep learning

deepensembles; swa-gaussian; swa; blundell+15; snapshot, analyses on the behavior of ensemble learning are an active research area garipov+18; fort2019deep; fort2019large; draxler18; modeconnect; nnsubspaces. Merely interpolating two different models (i.e., local and global models) for PFL as in previous works might be a valid approach, but explicitly promoting the mixture to be well-performed under the scheme of ensemble learning can be a more attractive choice in the context of PFL.

In this study, we propose SuPerFed

, a novel PFL method that leverages the power of ensemble learning, which is accomplished through subspace learning. Since subspace learning provides the optimal space of the objective function instead of a single optimum point, many possible model combinations can be generated, which would be useful for handling heterogeneous local data from multiple clients. Specifically, our method constructs a model mixture consisting of two models and involves two sequential optimization phases in order for each model of the mixture to be specialized in federation and personalization, respectively, and at the same time to lie in the harmonious subspace customized to the heterogeneous data distribution of each client. Thereby, without loss of overall performance of a global model as well as each local model per client, both models can be interpolated with each other more elaborately than existing model mixture-based PFL methods, which merely mixed or connected them. Under the general framework, two different model mixing schemes are presented to find better subspaces. The proposed method as model interpolation-based ensemble learning can be reduced to existing benchmark FL methods by adjusting model hyperparameters, and thus it can guarantee to have at least their performances. With extensive experiments on standard benchmark datasets in a realistic federated setting with non-IID local data, we empirically demonstrate the efficacy and robustness of our method. The method as an improved model mixture-based PFL method shows a decent performance gain compared with other comparable PFL methods by achieving better personalization in FL.

Related Works

FL with Non-IID Data After mc+17 proposed the vanilla FL algorithm (FedAvg), handling non-IID data across clients is one of the major points to be resolved in the FL field. zhao+18 showed that sharing a subset of each client’s local data to the central server is effective to prevent divergence of a global model. However, this is an unrealistic assumption as the original purpose of FL is to train a model in a data-decentralized setting. fedavgm suggested that accumulating model updates with momentum at the central server can mitigate the harmful oscillations caused by non-IID data. Modifying a model aggregation method at the central server pfnm; fedma; fedbe; pillu+19; fedbn and adding a regularization to the optimization fedprox; ka+19; feddyn; fedpd are other branches to help the convergence of a single global model under a heterogeneous data setting. Although these studies boost up the efficacy of FL, serving a single global model may still not be sufficient to deploy FL-driven services in practice.

PFL Methods As an extension of the above, PFL sheds light on the new perspective of FL. PFL aims to learn multiple personalized models suited to each client on top of the global model and many methodologies on PFL have been proposed. Multi-task learning-based PFL mocha; fedu; mar+21 treats each client as a different task and learns a personalized model for each client. pfedme proposed a PFL method that trains local and global models under the regularization term derived from Moreau envelops, while l2sgd adopts a similar regularization term that intermittently penalizes local models for limiting them to be distant from a global model. Local fine-tuning based PFL methods jiang+19; khodak+19; Fallah adopt a meta-learning approach (e.g., MAML maml) for a global model to be adapted promptly for constructing a customized personalized model of each client. Clustering-based PFL methods ghosh+20; clustered; mansour+20; fedgroup mainly assume that similar clients may reach a similar optimal global model. During federated optimization, similar clients are grouped into the same cluster using their intermediate information (e.g., local gradients) transmitted to the central server. Then, cluster-specific global models are constructed by aggregating local models within clusters and are used as a personalized model for clients in each cluster. Model mixture-based PFL methods apfl; FedPer; lgfedavg; mansour+20 are either explicitly or semantically dividing a model into two parts: one for capturing local knowledge, the other for learning common knowledge across clients. In these methods, only the latter part is shared to the central server while the former part resides locally in each client. FedPer defines the penultimate layer of a model as a personalization layer, and only exchanges other layers to the central server for constructing a global model. On the other hand, lgfedavg keeps the lower layers (or a local encoder) in clients and only exchanges a set of higher layers with the central server for learning a global model. In mansour+20 and apfl, each client holds at least two separate models: one for a personalized model, and the other for constructing a global model. After iterative updates, two models are combined in the form of a convex combination. This type of method is closely related to our proposed method.

Ensemble Learning of a Low-Loss Subspace

After many successful techniques on the ensemble learning of neural network models have been proposed

blundell+15; snapshot; swa; swa-gaussian; deepensembles, follow-up studies are proceeded for analyzing traits of ensemble models. In fort2019deep; fort2019large; modeconnect; garipov+18; draxler18, it is observed that there exists a low-loss subspace connecting two or more different networks separately trained on the same dataset, which is named as connector in nnsubspaces. Models consisting of connector

have a low cosine similarity with each other in the weight space when the two models are trained from different initializations. Though there exist some studies on constructing a low-loss subspace (

connector) in swa; swa-gaussian; izmailov2020subspace, a recent study conducted by nnsubspaces proposed a simple and efficient approach to finding a connector by learning a parameterized subspace. Our method adopts this technique to construct a subspace between each local model and a global model as an explicit ensemble learning method for PFL.

Figure 1: Overview of the learning dynamics of SuPerFed. It consists of two phases: Phase I (federation phase) and Phase II (personalization phase).

Proposed Method

Overview We describe our proposed PFL method, SuPerFed, in the perspective of model ensemble via subspace learning. We adopt a common federated optimization scheme which was proposed in mc+17, where the central server orchestrates federated optimization across participating clients with iterative communication of model parameters. Since our method is essentially a model mixture-based PFL method, we need two models per client (one for a global model, the other for a local model), of which initialization is different from each other, but the structure is the same. SuPerFed is composed of two phases: federation phase (Phase I) and personalization phase (Phase II).

Notations We first define arguments required for the federated optimization as follows: a total number of rounds (), a local batch size (

), a number of local epochs (

), a total number of clients (), a fraction of clients selected at each round (), an indicator of starting round for personalization (), and a local learning rate (). Next, there are three hyperparameters required for the optimization of our proposed method: a mixing ratio

sampled from the uniform distribution

, a constant for the construction of a low-loss subspace between local and global models by inducing orthogonality, and a constant for a proximity term to avoid divergence of global models that are locally updated from the aggregated global model in the previous round.

Problem Statement Consider each client () has its own dataset composed of training and validation sets and , and a model resides in a client. In PFL, we assume that datasets are non-IID across clients, i.e., each dataset is sampled independently from a corresponding distribution on . Denote as a class of hypotheses, and a hypothesis

is learned by the loss function

in the form of . We denote the expected loss by , and the empirical loss by . Then, the main objective of PFL is to optimize

(1)

and we can minimize this objective through the empirical loss minimization:

(2)

Under this setting, each client has a set of paired model parameters at round (), each of which is initialized with a different random seed. Each parameter is for training a global model and a personalized model within each client, using a local dataset

. We suppose both models have the same structure, i.e., number of layers, type of activation function, etc. On the client side, two models are mixed in the form of convex combination using

: . The mixing process is done at every training batch update in the local epoch, either in the form of model-wise (Model Mixing) or layer-wise (Layer Mixing) manner (See Phase II: Personalization for details). Note that the mixing is achieved in the weight space, not the output space. Then, it is optimized using a local training dataset with a modified local loss function , which is composed of canonical loss function with additional regularization terms (suppose we are client at round ):

(3)

where is a copy of a global model constructed from the previous round, and stands for cosine similarity defined as: . The first regularization term adjusted by

induces orthogonality between two weight vectors, which thereby yields a low-loss manifold between two weight spaces. The second regularization term controls the proximity of locally updated global model

from the aggregated global model at the previous round. By doing so, we expect each local update of the global model not to be derailed from each other, which prevents a divergence of aggregated global model at the center server.

On the server-side, only a global model of each client is transferred every round, and the central server aggregates them to make an updated global model: . This updated global model is again broadcast to selected clients as a local version of a global model: . By adjusting a set of hyperparameters, and , we define and describe Phase I and Phase II respectively in following subsections for detailed explanations of SuPerFed. See Figure 1 for simplified view of the proposed method.

Phase I: Federation The federation phase (Phase I) aims to train a global model to have general knowledge that is common to participating clients. It is a natural choice since training a personalized model at the early stage of FL may fail to achieve convergence of a global model. When setting and , the objective (3) becomes equivalent to optimize the objective proposed in fedprox; pfedme, which introduced a proximity regularization term for preventing a local update of a global model from diverging toward harmful directions. Until the federated optimization round reaches , which is an indicator of starting the personalization phase, clients contribute to update only a downloaded global model to boost up a performance of a global model aggregated at the central server.

Phase II: Personalization After Phase I is completed at , the personalization phase (Phase II) begins. The main goal of this phase is to build a personalized model for each client on top of the federated global model trained during Phase I. From this phase, a mixing process (Model Mixing or Layer Mixing) using is applied between and at every batch update during training. In detail, Model Mixing treats each model as a weight vector and combines them ignoring compositions and the number of layers, while Layer Mixing combines the two in a layer-wise manner, applying different values per layer. Since merely introducing the mixing process does not induce the benefit of ensemble learning, it is required to set . In fort2019deep; nnsubspaces, it is observed that independently trained models show dissimilarity between parameters in terms of cosine similarity, and thereby enforcing two differently initialized models to maintain dissimilar parameters with each other results in finding a low-loss manifold continuum between two models. Here, “between two models” is equal to every possible realization of a mixed model by varing . In ortho1; ortho2; ortho3, inducing orthogonality (i.e., forcing cosine similarity to zero) between parameters prevents a model learning redundant features given learning capacity. In this context, we expect the local model of each client to learn client-specific knowledge that cannot be captured by training of the global model in Phase I. The proximity regularization on the global model with is still valid in Phase II. See Algorithm 1 for understanding a whole optimization process of SuPerFed.

Input: Number of total rounds , start round of personalization phase , local batch size , local epochs , learning rate , number of fraction for client sampling, clients having their own dataset , for orthogonality, for proximity regularization.

Central server orchestrates:

1:  Initialize with different random seeds.
2:  Broadcast a copy of to all clients.
3:  
4:  for  do
5:     Select clients, at random.
6:     for each client in parallel do
7:         ClientUpdate()
8:     end for
9:  end for
10:  return a global model

ClientUpdate():

1:  Download from the server
2:  
3:  for  do
4:     Split into batches of size
5:     for local batch  do
6:        
7:        
8:         (3)
9:     end for
10:     
11:     
12:  end for
13:  upload locally updated version of global model to the server
Algorithm 1 SuPerFed

Experiments

Datasets and Settings for PFL We used four datasets: CIFAR10, CIFAR100 cifar, EMNIST emnist and TinyImageNet tinyimagenet. All of each dataset is for a multi-class image classification task, containing 60,000 images with 10 classes, 60,000 images with 100 classes, 814,255 images with 62 classes, and 110,000 images with 200 classes, respectively.

As model mixture-based PFL provides both a single global model and multiple personalized models, we at first divide the test set from each dataset (CIFAR10, CIFAR100, EMNIST, TinyImageNet in order) for the evaluation of a global model (named as global test set), each of which contains 10,000, 10,000, 82,587 and 10,000 samples from all classes. Then, the remaining portion of each dataset is split and used for updating and evaluating personalized models. For CIFAR10 and CIFAR100 datasets, we select the pathological non-IID split setting proposed in mc+17, where each client has the same number of samples from at most two classes. For a more realistic scenario suited to PFL, we select another split strategy for EMNIST emnist and TinyImageNet tinyimagenet

datasets by splitting them semantically by labels, with unbalanced sample sizes. A detailed description of the dataset split is described in Appendix. For all experiments, we used a simple 5-layer convolutional neural network as a skeleton of both global and local models. A detailed structure of the network is described in Appendix. We used SGD optimizer with momentum 0.9 throughout all experiments.

Throughout all experiments, if not specified, we set , , , , and

to 100, 100, 1,543, and 390 for CIFAR10, CIFAR100, EMNIST, and TinyImageNet, respectively. As a consequence, each client has 500 samples when the dataset is CIFAR10 or CIFAR100, 452 samples on average (with the standard deviation of 203) when the dataset is EMNIST, and 252 samples on average (with the standard deviation of 88) when the dataset is TinyImageNet. We further make each client have 80% of total samples for its local training set, and remaining 20% of total samples for its local test set. Thereby, the performance of the personalized model is reported by averaging the values of an evaluation metric (i.e., accuracy) on the local test set from all clients.

Main Results In this section, we empirically observe behaviors of SuPerFed through extensive experiments and evaluate its performance on personalization in several benchmark datasets. As SuPerFed inherits the scheme of model mixture-based PFL, we compared our method with other model mixture-based PFL methods (FedPer FedPer, APFL apfl, LG-FedAvg lgfedavg) and two basic FL methods sharing a similar optimization strategy with ours (i.e., model averaging and proximity regularization) (FedAvg mc+17, FedProx fedprox)). We also considered variants of FedAvg and FedProx for fair comparison on the personalization performance by adopting the strategy proposed in wang. They presented a simple personalization method after finishing the update of a single global model through FL: locally updating a global model with small steps using local data of the clients. We denote the variants as FedAvg-LU and FedProx-LU, where LU stands for the local update. We report the best accuracy of each method by tuning its hyperparameters. Our method (SuPerFed-MM and SuPerFed-LM) achieves the best mean local accuracy (i.e., personalization performance) on all datasets. Note that as LG-FedAvg and FedPer only exchange parts of a model, their global model accuracy cannot be evaluated on the global test set at the central server.

Results of the pathological non-IID setting (CIFAR10 and CIFAR100) are shown in Table 1, and those of the realistic scenario (EMNIST and TinyImageNet) are shown in Table 2. Our method outperforms existing model mixture-based PFL methods as well as basic FL methods and its variants in terms of personalization performance.

We implemented all methods using PyTorch

pytorch version 1.8.1, CUDA version 11.1, and all of our experiments are conducted on ubuntu LTS 18.04 equipped with Intel(R) Xeon(R) Gold 5120 CPU, one NVIDIA Tesla A100 GPU, and 122GB memory.

CIFAR10 CIFAR100
Global
Top-1 Acc.
Global
Top-5 Acc.
Personalized
Mean Acc.
ECE
Loss
Global
Top-1 Acc.
Global
Top-5 Acc.
Personalized
Mean Acc.
ECE
Loss
FedAvg 47.09 89.37 17.58 42.91
FedAvg-LU
FedProx 45.17 89.22 17.78 43.69
FedProx-LU
APFL 31.60 79.36 19.81 47.67
FedPer - - - -
LG-FedAvg - - - -
SuPerFed-MM 45.35 89.87 17.62 43.36
SuPerFed-LM 47.06 90.07 18.17 43.85
Table 1: Experimental results on the pathological non-IID setting (CIFAR10 and CIFAR100 datasets) compared with other FL and PFL methods.
EMNIST TinyImageNet
Global
Top-1 Acc.
Global
Top-5 Acc.
Local
Mean Acc.
ECE
Loss
Global
Top-1 Acc.
Global
Top-5 Acc.
Local
Mean Acc.
ECE
Loss
FedAvg 76.67 98.24 7.92 21.43
FedAvg-LU
FedProx 75.49 98.32 7.97 21.79
FedProx-LU
APFL 76.62 98.22 8.52 22.88
FedPer - - - -
LG-FedAvg - - - -
SuPerFed-MM 73.53 97.15 9.09 23.75
SuPerFed-LM 73.85 97.28 9.02 25.10
Table 2: Experimental results on the realistic scenario (EMNIST and TinyImageNet datasets) compared with other FL and PFL methods.

When to Start Personalization We conducted an experiment on selecting , which determines the initiation round of Phase II, the personalization phase. As our proposed method can choose a mixing scheme from either Model Mixing or Layer Mixing, we designed separate experiments per scheme. We denote each scheme by SuPerFed-MM and SuPerFed-LM. By adjusting by 0.1 incremnts from 0 to 0.9 of , we evaluated the mean accuracy of personalized models and the accuracy of the global model. This experiment was done only using the CIFAR10 dataset with , . Experimental results are summarized in Table 3.

We see that there are some differences according to the mixing scheme. For Model Mixing, the performance of the global model seems similar in different s. On the other hand, in Layer Mixing, it tends to increase as increases. This might be due to the size of the hypothesis space of each setting. Since Layer Mixing can have more diverse combinations of models, there are more chances to diverge when the personalization phase starts from the early round. In terms of the personalization performance, setting shows the best performance with a small standard deviation. These imply that the proposed phase-wise learning scheme is a valid strategy. While is a tunable hyperparameter, it can be adjusted by the purpose of FL. To derive a good global model, choosing a large close to is a valid choice, while in terms of a personalization, it should be far from .

SuPerFed-MM SuPerFed-LM
Local
Mean Acc.
Global
Top-1 Acc.
Global
Top-5 Acc.
Local
Mean Acc.
Global
Top-1 Acc.
Global
Top-5 Acc.
0.0
44.51 90.23
33.61 83.78
0.1
44.22 89.28
38.47 86.09
0.2
46.02 89.29
40.93 87.73
0.3
45.06 89.48
43.74 88.39
0.4
45.35 89.87
45.35 89.51
0.5
45.19 88.97
45.92 89.05
0.6
45.09 89.38
45.23 88.82
0.7
47.02 88.80
46.02 89.75
0.8
46.74 89.36
47.06 90.07
0.9
47.73 89.74
48.06 89.62
Table 3: Experiment on selecting for determination of the starting round of the personalization phase (Phase II). Best mean accuracy is reported for local models given : indicates , for , for and for .

Ablation Study: Effects of Hyperparameters Our proposed method has hyperparameters, and . Each hyperparameter is expected to act as (1) constructing a good subspace where both a global model and a local personalized model can benefit, by inducing the orthogonality between local and global models, and (2) preventing undesirable divergence of local updates from the direction of global update.

We conducted experiments according to the combination of and (i.e., 16 combinations in total), under Model Mixing and Layer Mixing settings respectively with and , which shows the best personalization performance, on the CIFAR10 dataset. Experimental results on these combinations are displayed in Figure 2.

There are noticeable observations supporting the efficacy of our method. First, an ensemble learning strategy by finding a low-loss subspace between local and global models yields personalized models with high performance. We can see that in most settings with non-zero and , a mixture of global and local models (i.e., models realized by ) is better than two single models from both endpoints (i.e., a global model () and a local model ()). Second, moderate degree of orthogonality regularization (adjusted by ) boosts up the personalization performance of the local model, whereas, too large value (i.e., ) rather could harm the performance. Third, the proximity regularization restricts learning too much when is large, but often helps a global model not to diverge when it is small. Therefore, as we expected, we can make by carefully choosing

Meanwhile, when ignoring both and , SuPerFed is reduced to FedAvg, while only ignoring , it is reduced to FedProx (with an additional condition that there is no Phase II, or fixing ).

Figure 2: Effects of hyperparameters and . Left group of plots colored in orange is the performance of SuPerFed-MM, right group of plots colored in sky-blue is the performance of SuPerFed-LM. Vertical axis of each subplot stands for the accuracy, and horizontal axis is the range of possible values from 0 to 1. Metrics on each subplot indicate the performance of the global model (located above) evaluated by global test set at the central server, and the best average performance of the local model (located in parentheses) evaluated by local test set at each client. Error bar indicates a standard deviation of each personalized model realized by different values of .

Calibration While ensemble learning method can boost up the performance of machine learning models deepensembles; fort2019deep; swa; blundell+15

, it also provides the uncertainty estimation, which is closely related to how well-calibrated a model is. In this context, we evaluate the expected calibration error (ECE;

ece) of our proposed method and compare them with others. ECE measures the consistency between the prediction confidence and the accuracy, therefore lower value is preferred. Results are shown in Table 1 and 2.

Conclusion and Future Work

In this paper, we propose a simple and scalable personalization method for federated optimization under the scheme of model mixture method by constructing a low-loss subspace between a local and a global model. With extensive experiments, we empirically demonstrated that SuPerFed outperforms existing model mixture-based PFL methods and two basic FL methods with its variants in two different model mixing scheme: (Model Mixing) and Layer Mixing). As our method stems from constructing an ensemble-friendly subspace, resulting personalized models are well-calibrated at a nominal cost, which could possibly be robust to the label noise in each client. In future work, we plan to do an in-depth analysis on the unusual behavior of a global model (i.e., high accuracy after Phase II despite increasing of the loss), and on the convergence rate of our method.

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government (MSIT) under Grant No. 2020R1C1C1011063, as well as the Kakao i Research Membership Program by the Kakao Enterprise.

References

Appendix

Dataset Name (→)
Layer Name (↓)
EMNIST
CIFAR10
&
CIFAR100
TinyImageNet
conv1
in=1,
out=32,
k=(5, 5),
s=1,
p=0
in=3,
out=32,
k=(5, 5),
s=1,
p=0
in=3,
out=32,
k=(5, 5),
s=1,
p=0
maxpool1
k=(2, 2),
p=0
k=(2, 2),
p=0
k=(3, 3),
p=0
conv2
in=32,
out=64,
k=(5, 5),
s=1,
p=0
in=32,
out=64,
k=(5, 5),
s=1,
p=0
in=32,
out=64,
k=(5, 5),
s=1,
p=0
maxpool2
k=(2, 2),
p=0
k=(2, 2),
p=0
k=(3, 3),
p=0
conv3
in=64,
out=32,
k=(2, 2),
s=1,
p=0
in=64,
out=32,
k=(3, 3),
s=1,
p=0
in=64,
out=32,
k=(2, 2),
s=1,
p=0
maxpool3
k=(2, 2),
p=0
k=(2, 2),
p=0
k=(3, 3),
p=0
fc1
in=32,
out=512,
k=(1, 1),
s=1,
p=0
in=32,
out=512,
k=(1, 1),
s=1,
p=0
in=32,
out=512,
k=(1, 1),
s=1,
p=0
fc2
in=512,
out=62,
k=(1, 1),
s=1,
p=0
in=512,
out=10 or 100,
k=(1, 1),
s=1,
p=0
in=512,
out=200,
k=(1, 1),
s=1,
p=0
Table 4: Model architecture (both a global and a local model) at each client. Abbreviations in, out,

stand for the number of input channels, the number of output channels, the size of kernels, the size of strides, and the size of input paddings.

Figure 3: Optimization trajectories of a global model at the central server (first row: SuPerFed-MM, second row: SuPerFed-LM, first column: Top-1 accuracy curve, second column: loss curve). For the start of personalization phase is selected (among rounds), and it is displayed as the black vertical line on each subplot.
Figure 4: Effects of local batch size and local epochs . Left group of plots colored in green is the performance of SuPerFed-MM, right group of plots colored in magenta is the performance of SuPerFed-LM. Vertical axis of each subplot stands for the accuracy, and horizontal axis is the range of possible values from 0 to 1. Metrics on each subplot indicate the performance of the global model (located above) evaluated by global test set at the central server, and the best average performance of the local model (located in parentheses) evaluated by local test set at each client. Error bar indicates a standard deviation of each personalized model realized by different values of .

Model Architecture We used a simple 5-layer convolutional neural network for both a global and a local model located in each client. In other words, each local client has two models with the same structure but different initialization values. Detailed description of the model architecture is shown in Table 4. We used ReLU activaton function for all layers except the output layer.

Dataset Split Scenario We used four benchmark datasets CIFAR10, & CIFAR100 cifar, EMNIST emnist, and TinyImageNet tinyimagenet. As these datasets are not designed for FL setting, we curated them to be sutitable for FL setting (i.e. non-IID data across clients).
For both CIFAR10 and CIFAR100, we selected a scheme of pathological non-IID setting proposed in mc+17, where each client has samples from at most two classes, thereby some clients have only a partial labels that are mutually exclusive to some other clients, while all clients having the same sample sizes.
On the other hands, for EMNIST and TinyImageNet datasets we split them semantically (i.e., realistic scenario) with unbalanced sample sizes. It is intended to simulate a more challenging scenario closer to a realistic situation where most AI-driven FL service providers may face.
For EMNIST, though it is composed of 62 classes, it is further grouped into three groups semantically: 10 digits, 26 lowercase alphabets, 26 uppercase alphabets. Therefore, we randomly shuffle and split them semantically (i.e., per each label group) into five clusters. As a result, each cluster has 2 digits, 5 to 6 lowercase and uppercase letters, which are not overlapped across clusters. Then, we randomly split samples within each cluster to clients having sample size from 100 to 800. Thereby, we finally have 1,543 clients having 452 samples on average with the standard deviation of 203. For TinyImageNet, as it has totally 200 classes, we make 10 groups having samples from 20 mutually exclusive classes. In resulting 10 groups, we randomly split samples within each group to clients having sample size from 100 to 400. Thereby, we finally have 390 clients having 252 samples on average with the standard deviation of 88.

Detailed Configurations of Experiments Throughout all experiments, if not specified, we set a local batch size , local epochs , a fraction of clients , learning rate (SGD optimizer with momentum 0.9), and the number of clients to 100, 100, 1,543, and 390 for CIFAR10, CIFAR100, EMNIST, and TinyImageNet each. For each dataset, we applied different total rounds , which is 200, 300, 100, and 500 for CIFAR10, CIFAR100, EMNIST, and TinyImageNet each. For the ablation study, we used CIFAR10 dataset with only modification on the total rounds, .

Unusual Behavior of a Global Model One interesting observation is the optimization trajectory of a global model aggregated at the central server (see Figure 3). When the Phase II starts (denoted as black vertical line at ), the loss of the global model is surged upward, with no much decrease in the performance (i.e., accuracy), which is counterintuitive. This tendency is more noticeable when we select the Layer Mixing scheme. Even though the loss is soared up to almost five times in a magnitude, performance curve of a global model still retains high accuracy, with diminished oscillations. Moreover, when rounds are proceeded up to 2000, we observed that both accuracy and loss are converged.
We presume the cause of this phenomenon is for the orthogonality inducing regularization. Since neural networks are usually overparameterized and can memorize training data with almost zero errors overparam1; overparam2, the global model may also be capable of storing knowledge (in the form of parameters to be aggregated) from many different clients. As orthogonality inducing term decorrelates local models and a locally updated global modelortho1; ortho2; ortho3, we presume that what each model learned can possibly be aggregated with no interference with each other, even under a simple weighted averaging scheme.

Effects of Local Batch Size and Local Epochs In FL, the behavior of local updates is of importance in terms of the convergence, communication and computation trade-offs convergence. Thereby, we conducted more experiments on the effects of both local batch size and local epochs by varying them on CIFAR10 dataset with . In Figure 4, we reported results of experiments similar to Figure 2.
It can be easily notified that extremely small local updates (i.e., ) harm both local and global model’s performance, while longer local updates (i.e., ) did better but sometimes failed (i.e., ). Local batch size also showed the same trend. When the batch size is too small, we cannot achieved an optimal performance, which may be due to too noisy update at each client can cause divergent behaviors across clients. Recall that our proposed method basically pursuits an ensemble of a local and a global model at each client by mixing (i.e., combining) different combination of the local and the global model at every training batch update, it is favorable to retain some degree of stochasticity on local updates (i.e., maintaining sufficient number of iterations). Thus, we conclude that selecting a proper configuration of local update (e.g., ) is more helpful in constructing an efficient client-specific subspace for a personalization, while considering computation and communication efficiency.