Federated learning (FL) is a collaborative learning technique to build machine learning models from data distributed among several participants ("agents"). The objective is to generate a common, robust model not by exchanging the data between agents, but rather through exchanging the parameter updates of the common model that all agents share at a certain frequency. This technique addresses some major problems of centralized data-sharing approaches, can enable data privacy and security and is therefore attractive to many applications with sensitive data.
FL can be centralized or decentralized. Our case considers a centralized approach, in which a single server orchestrates the evolution of the algorithms, as opposed to decentralized learning, where the orchestration is delegated to the interconnected nodes. Each iteration of FL can be divided into four main steps. Firstly, the central server sends the current model to all agents. Secondly, each agent trains the model with their data. Thirdly, the agents then send the updated model parameters back to the central server, and finally, the central server aggregates these results and generates a new global model. WAFFLE intervenes in this last step: the individual agents’ contributions are aggregated in such a way as to produce a personalized model for each specific agent, instead of a global one-fits-all model.
1.1 Heterogeneous Data Distributions
Federated learning develops a common model for all agents, typically under the assumption that data is independent and identically distributed (IID) across agents. In real applications this assumption rarely holds, and thus the performance of the learning process can vary significantly. Five categories of non-IID data have been identified by kairouz2019_FL_advances_open_problems:
Covariate shift (Change in the distribution of the input variables present in the training and the test data)
Prior probability shift (The target variable distribution changes but the input feature distribution does not)
Concept drift (Change in the relationship between the input and output variables of the problem i.e. not related to the distribution of the data or classes.)
Concept shift (For the same features, each agent has a different label)
Unbalancedness or label skew (The distribution of labels varies between agents, but the "true" classification function is the same for all agents)
In this paper we focus on concept shift and label skew. As concept shift does not occur naturally in our datasets, it is simulated by randomly swapping the labels of all agents, except for the selected agent, which we will call Alice. In this scenario, Alice seeks a personalized model and requests assistance from the other nine agents. A solution for dealing with non-IID data is to train a model personalized to the unique data distributions of each agent, rather than a single global model. Recently, there has been significant interest in this topic with several studies proposing model personalization strategies [grimberg2020weight, deng2020APFL, li_ditto_FL_2021, fallah_personalized_MAML_2020].
1.2 WAFFLE: Building on SCAFFOLD
WAFFLE, as mentioned earlier, is based on SCAFFOLD as proposed by karimireddy2020scaffold. SCAFFOLD is an attractive starting point as it addresses the phenomenon of client drift
(for a global model). This phenomenon occurs when training data is heterogeneous (non-IID) between agents, prolonging the time-to-convergence for federated averaging (FedAvg). The other interesting aspect of SCAFFOLD is that each agent uses an estimate of the sum of the gradients of all other agents to “correct” each local SGD step. This approach allows SCAFFOLD to converge faster than FedAvg and towards a better global model. WAFFLE seeks to keep this faster convergence but build a personalized model instead of a global model. We take inspiration from the Weight Erosion scheme bygrimberg2020weight by assigning a weight to each agent based on the Euclidean distance between their update and the update computed by Alice in the same round.
We build on an existing FL framework, called SCAFFOLD and transform it to create a personalized collaborative machine learning algorithm named WAFFLE (Weighted Averaging For Federated LEarning).
We use the Euclidean distance between agents’ updates to weigh their contributions and thus minimize the trained model’s loss on one specific agent.
We show that WAFFLE can perform with hyperparameters that are easier to handle than most personalized learning methods
We contribute to a broader evaluation of two personalized federated learning methods: APFL and Weight Erosion.
We show that in the two kinds of heterogeneous data distributions (concept shift and label skew) in most cases WAFFLE matches or improves the accuracy with a faster convergence compared with other personalized FL methods.
1.4 Related Work
Recently, there have been several key works on personalization in federated learning, see e.g. kulkarni2020survey for an overview. Some methods of adapting global models for individual agents are summarized in kairouz2019_FL_advances_open_problems, such as local fine-tuning, multi-task learning, and meta-learning.
Local fine-tuning is a particularly popular method [deng2020APFL]
. The principle is that each agent receives the same global model and tunes it by performing a small number of stochastic gradient descent steps on its local data. WAFFLE, the algorithm proposed in this paper, is conceptually related to local fine-tuning, as are Weight Erosion[grimberg2020weight] and Deep layer aggregation [Wang2019]. Multi-task learning solves several related tasks at the same time, with this method we consider the optimization on each agent as a new task. The MOCHA algorithm is an example using this method [Smith2017], it learns both the model parameters and a matrix of relations between devices, and applies to convex problems. More recently li_ditto_FL_2021 propose a framework alternating between solving for the average solution and local solutions. The last method is meta-learning, which is based on the adaptability of the model to solve new tasks by having been previously exposed to a variety of tasks. It has the advantage of building learning models that require only a small number of training examples to solve new tasks as was described by Finn2017 with the model-agnostic meta-learning (MAML) algorithm. A new approach by fallah_personalized_MAML_2020
implements MAML and seeks to find a global model that performs well after each user has updated it with respect to its own loss function. A recent idea similar to meta-learning is an interpolation of a local and global model, which has been studied with APFL in[deng2020APFL]. The authors propose an adaptive federated learning algorithm that learns a mixture of local and global models as a personalized model.
2 The WAFFLE Algorithm
To introduce WAFFLE, we will first summarize the known SCAFFOLD FL algorithm on which it is based, and then detail how personalization can be achieved by weighted averaging between agents, for suitable weight choices.
SCAFFOLD maintains an estimate of what all agents are learning, called a control variate, so that agents can simulate having all the data and estimate the direction in which they should update their gradient. Thus, it is more efficient and addresses the problem of client drift. karimireddy2020scaffold prove that SCAFFOLD converges to the globally optimal model instead of converging to a weighted average of local models and does so much faster and more accurately than FedAvg in the presence of inter-agent heterogeneity.
2.1 Personalization Using Weighted Update Aggregation
WAFFLE is essentially a personalized version of SCAFFOLD, where the goal is no longer to get a global model for all agents but a personalized model for one particular agent, (Alice). The idea is to start from global training (SCAFFOLD) and to gradually move to local training.
SCAFFOLD computes the gradient for the server model () and the control variate (the assumption of what other agents are learning) by averaging each agent’s update and local control variates at each round [karimireddy2020scaffold, Algorithm 1, Line 16]. It is at this step where WAFFLE intervenes: Instead of averaging all agents at each round, WAFFLE assigns a weight to each agent and computes and by a weighted combination as per eq. 1. Changes with respect to SCAFFOLD are highlighted in red:
Suppose we have agents. Then we recover SCAFFOLD (global training) by setting all weights equal to . In contrast, we recover local training by setting all weights to except the weight of Alice ().
WAFFLE focuses on a smooth transition from global to local training. To do this, the weight of each agent is updated according to the degree of personalization desired for the round. Just as SCAFFOLD uses the control variates to converge to the global model for the union of all agents’ datasets, WAFFLE uses them to converge to a personalized model based on a weighted subset of agents (Figure 1). Each agent is included and weighted in this subset based on a measure of the degree of personalization for the round and whether their dataset is sufficiently similar to Alice’s, as measured by the distance between their gradients at each round.
For each agent, its weight at a given round depends on the distance between it’s and Alice’s gradient. As the method uses SGD (with stochastic gradients), the behavior of the method to obtain the weight for WAFFLE is subject to stochastic noise. Indeed, at some rounds, some agents may have an atypical gradient resulting in a bad weight even if their contribution is still needed for the global model. Therefore, we average the current weight with the last two weights to smooth out the random effect. (lines 11 & 12 of Algorithm 1).
This averaging technique reduces the risk of model being trapped in an undesirable local minimum. For example, if a given agent participates less in the global model compared to other agents for a given round, then the global model will be less likely to go in the direction of this agent’s local minimum. Thus, the gradient of the agent would likely be very different from Alice, causing the weight to tend to zero and allowing the model to move to another local minimum.
To compare each agent with Alice we use the Euclidean distance between gradients. Simply, we want an agent with a small distance to have a higher weight than an agent with a larger distance. Like SCAFFOLD, we have a central server that coordinates all the agents. At each starting round it gives the current personalized model and control variate to all agents for training. When the agents finish training the model with their data, they send the updated model and their local control variate back to the server. The server then calculates the Euclidean distance of each update from the selected agent and modifies the weight assigned to each agent accordingly. Finally, it calculates the new personalized model and the new control variate (eq. 1). The implementation of waffle can be seen in the modified version of Scaffold (Algorithm 2) where the code in red is the modification applied to the original code. The algorithm of CalcWeight can be seen in Algorithm 1 and will be explained in the next subsection.
2.2 WAFFLE Agent Selection Weights
WAFFLE uses a new data-dependent approach to define the contribution weight of agent at each round . It bases the weight on hyperparameters and that are both functions defining the degree of personalization for the round (where a value of corresponds to global training and is personalized (local) training).
The weight is calculated according to two formulas, where
agent is Alice
is the distance between the gradients of agents and
is the largest such distance:
Analogously, is the smallest such distance (excluding )
and denote the degree of personalization for round
The weight of an agent is based on the distance between its gradient and the gradient of agent , but also on the distribution of the distances of the agents. The distance of would always be 0, we want to change it to a similar value compared to the distances of other agents. The further away Alice’s distance is from the others, the less weight we assume the agents have compared to Alice. Equation 2 serves to increase the distance of Alice based on the hyperparameter . Indeed, when is equal to one it will update the distance value assigned to Alice () close to the second smallest distance . The more decreases towards 0 the further away Alice will be.
Equation 3 serves as a threshold of inter-agent utility. To pass the threshold, the gradient of the agent must be closer than the other agents based on the range of all distances. With a low only a small fraction of the range between the minimum and maximum distance will map to a weight greater than 0. Thus, learning is concentrated in a subset of agents with the most similar gradients.
Using the two functions and , we can define different strategies for WAFFLE. We could as an example shifts rapidly toward zero and thus drastically limits larger steps in global training in the initial phase of personalization.
As Algorithm 1 would otherwise have 2R positive real-valued parameters to tune (values of and for each round), we restrict ourselves to a simple 1-parameter function of . However, more methods should be investigated to see the full potential of WAFFLE in different scenarios.
In this paper, the strategy used is based on a sigmoid function, it was designed to give the general training a gradual slope towards smaller valueseq. 4 and fig. 8. The function has a parameter that controls the slope, a higher value will steepen the slope and thus move faster towards the local training. According to our experiments, a good value for is around 3.2, irrespective of the model used. The function with this value achieve local learning after 70-90% of the total number of rounds, depending on the similarity of the agents’ gradient to that of Alice. A simple modification to delay or accelerate the local learning time is to move the function horizontally by adding a value to r.
3 Benchmarking Personalized Collaborative Learning Methods
To compare the performance of WAFFLE with the state of the art, we evaluate it against two recent personalized federated learning methods—APFL [deng2020APFL] and Weight Erosion [grimberg2020weight]—as well as two global methods—Federated Averaging [mcmahan2017fedavg] and SCAFFOLD [karimireddy2020scaffold]
—and finally, local training. SCAFFOLD and Weight Erosion were selected to validate whether WAFFLE outperforms the methods it builds upon. We included APFL in the benchmark as an unrelated personalized FL method which has an open-source implementation. The benchmark consists of training standard image classification networks on standard IID and non-IID image datasets described below.
Two 10-class image classification tasks are used based on either the MNIST or CIFAR10 datasets. To generate confidence estimates, we run the MNIST benchmark with five different seeds. However, running the CIFAR10 benchmark with five seeds would be prohibitive in the face of computational limitations.
3.1 Distributions and Models
We tested five distributions (A,B,C,A*,B*), the first one (A) is IID and serves as a baseline. A classic label skew distribution (B) was selected for consistency with several other benchmarks [Collins2021, deng2020APFL, McMahan2016]
, where only certain agents have the label represented (uniformly distributed amongst each other). Distribution C is a more experimental version of this scenario where the labels are not uniformly distributed among the agents, e.g. each agent will have 40% of one label, 20% of two labels, and 10% of two other labels. In addition, we also present two other distributions which are the concept shift on the IID distribution (A*) and the concept shift on the classic label shift (B*) . The concept shift is achieved by randomly swapping the class labels of all agents, except for Alice. We defined a distribution as a list of ten numbers, the sum of which is equal to one (seeFigure 2)). For each agent we will move the list to the right to get non-IID data, e.g. with distribution B, agent 0 will only have labels 0 to 3, agent 1 labels 1 to 4 and so on until agent 9 with labels 9 and 0 to 2.
For MNIST we use as model LeNet-5 a simple but well-known model proposed in [LeCun1989] and for CIFAR10 we use DLA from [Yu2017] for the CIFAR10 task.
We use an SGD optimizer with a learning rate of 0.1 for MNIST and 0.01 for CIFAR10. Following [Yu2017] we add a momentum 0.9 and a weight decay 5e-4 for CIFAR10 . Concerning the batch size, it is set to 32 for MNIST and 128 for CIFAR10. Except for the APFL method, we define the hyperparameters in a principled manner (based on the expected behaviour of the algorithm). As Weight Erosion and WAFFLE use a weight system, we attempt to set the parameters in such a way that both algorithms reach fully local learning at about the same time. In practice, this was difficult to achieve as some hyperparameters are very sensitive. Fully local learning is obtained when all weights reach zero except for Alice (Agent 0), who has a weight of one. Figure 3 shows an example of the evolution of the weights. Alice (agent 0) is placed in the centre. For the hyperparameters of APFL, we use the same method mention in the paper by [deng2020APFL], and adaptively update the hyperparameter to guarantee the best generalization performance.
The tuned hyperparameters are listed in Table 1. We can notice that out of the two methods, WAFFLE is the easiest to parametrize, as the hyperparameter does not change in contrast to the system of weight used by Weight Erosion. This is because WAFFLE can work very well without tuning, as opposed to Weight Erosion which requires hyperparameter tuning to perform well. As explained in Section 2.2, we purposefully restrict WAFFLE to a single hyperparameter, eq. 4 and we use the same value for all experiments. Nevertheless, we show in the result section that WAFFLE is able to obtain a very good accuracy even without tuning.
We present the results of our benchmarking in Table 2
below. For each method and each distribution, it shows the best accuracy reached at any of the 100 epochs. In the MNIST section, results are averaged across five random seeds and listed along with the standard deviation as.
On MNIST, WAFFLE and Weight Erosion outperform the other methods for the label skews (B, C), but not for IID distribution (A). The difference between WAFFLE and Weight Erosion resides in the speed of convergence: As expected WAFFLE converges faster thanks to the use of control variates.
For the concept shift distributions (A*, B*), all personalized FL methods improve accuracy compared to global FL methods. Among all personalized FL methods, Weight Erosion performs slightly better in this setting. Here, WAFFLE is not more efficient than Weight Erosion (time taken to reach convergence). This may be related to the fact that only small fractions of agents are useful in this distribution, so until we reach this critical point, both algorithms are bound to the same slow increase until a threshold. Regarding APFL, the method also performs better than global FL methods except for the IID distribution (A), but it struggles to achieve a better result than the other two personalized FL methods and local training.
Concerning CIFAR10, the lack of other seeds makes us less confident about the interpretation of our results, however, the results are coherent with our expectations. For example, like with MNIST, global FL methods outperform personalized methods on the IID distribution and present a significant improvement compared to local training. Distribution B and C (label skew) produce results close to those of MNIST, where WAFFLE and Weight Erosion outperform others. Concerning concept shift, personalized FL methods are still a great improvement compared to global FL or local training. The accuracy of WAFFLE and Weight Erosion is approximately the same but as for MNIST we don’t have a noticeable difference between the speed of convergence to best accuracy. For APFL, we get similar results as MNIST, APFL performs better than the global FL methods, but this time it is better than the local training, unfortunately, compared to the other two customized methods, it performs much worse.
5 Limitations and Future Work
The benchmark has a few limitations that should be addressed in future work.
- Non-IID distributions.
This work only focuses on certain categories of non-identical data, which limit our evaluation of WAFFLE. Further categories like Prior probability shift, Concept drift, and Covariate shift should be tested.
- Dataset selection.
The benchmark could also have been more realistic with data sets created for this purpose, like FEMNIST or real world data sets, which would benefit from private personalised learning such as medical images.
Hyperparameters were also a limitation, as some can be very sensitive (Weight Erosion and WAFFLE) and more rigorous testing methods are necessary to find the optimal settings.
- Computational cost.
A major limitation was the number of runs we were able to perform due to computational time limitations, especially for CIFAR10. This only allowed us to report the results of one seed which reduces the confidence of the results.
- and .
The WAFFLE benchmark is also limited by the functions we chose for and . Both of these functions were designed to offer versatility in WAFFLE, so that they could be optimized for various contexts. Depending on the category of non-identical data, it might be that some type of and functions gives better results in general. Therefore, more research should be done to find better functions depending on the problem.
WAFFLE must still be benchmarked against more personalized FL methods. This work only shows two methods (Weight Erosion and APFL). However, the performance of APFL was lower than anticipated. It is possible that this is because APFL adapts poorly to the tasks proposed in our setting and further experimentation is recommended.
Finally, we assume honest participation of agents which may not be the case in reality. While tests should be done to verify the strength of WAFFLE against a data poisoning attack, we assume that WAFFLE should not be vulnerable by design as the weighting mechanism should mitigate vulnerability by ignoring malicious participants.
In this work, we propose WAFFLE, a personalized federated learning algorithm based on SCAFFOLD, that can overcome client drift and accelerate convergence to an optimal local model. WAFFLE uses the Euclidean distance between agent updates to weigh their contributions and thus move from a global model to a personalized one. We explored the performance of WAFFLE in two cases of non-identical data—concept shift and label skew—on two standard image datasets—MNIST and CIFAR10. We demonstrate that in most cases WAFFLE shows a faster convergence and a possible improvement in accuracy compared with other personalized FL methods. We also demonstrate that WAFFLE is easier to handle than most personalized learning methods. Indeed, the same (fixed) default value for its single hyperparameter yielded competitive results in all test cases. However, by design, WAFFLE can be adapted to a task by modifying and functions and thus potentially obtain a better accuracy.
We thank David Roschewitz for sharing with us his implementation of APFL, and Freya Behrens for inspiring us on how to create synthetic concept shift. We are grateful to Celiane De Luca for her suggestions and comments on the writing of this paper. We thank Vincent Yuan and Damien Gengler for their work on an interdisciplinary machine learning project where they, along with Martin Beaussart, adapted Weight Erosion to a neural network for the first time, leading to this work.
Appendix A Appendix