Decentralized federated learning of deep neural networks on non-iid data

by   Noa Onoszko, et al.

We tackle the non-convex problem of learning a personalized deep learning model in a decentralized setting. More specifically, we study decentralized federated learning, a peer-to-peer setting where data is distributed among many clients and where there is no central server to orchestrate the training. In real world scenarios, the data distributions are often heterogeneous between clients. Therefore, in this work we study the problem of how to efficiently learn a model in a peer-to-peer system with non-iid client data. We propose a method named Performance-Based Neighbor Selection (PENS) where clients with similar data distributions detect each other and cooperate by evaluating their training losses on each other's data to learn a model suitable for the local data distribution. Our experiments on benchmark datasets show that our proposed method is able to achieve higher accuracies as compared to strong baselines.



There are no comments yet.


page 6


Implicit Model Specialization through DAG-based Decentralized Federated Learning

Federated learning allows a group of distributed clients to train a comm...

Detection of Insider Attacks in Distributed Projected Subgradient Algorithms

The gossip-based distributed algorithms are widely used to solve decentr...

Homogeneous Learning: Self-Attention Decentralized Deep Learning

Federated learning (FL) has been facilitating privacy-preserving deep le...

Decentralized Bayesian Learning over Graphs

We propose a decentralized learning algorithm over a general social netw...

Peer Learning for Skin Lesion Classification

Skin cancer is one of the most deadly cancers worldwide. Yet, it can be ...

Peer-to-peer Federated Learning on Graphs

We consider the problem of training a machine learning model over a netw...

Subspace Learning for Personalized Federated Optimization

As data is generated and stored almost everywhere, learning a model from...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated learning (FL) (McMahan et al., 2017) is a framework developed to enable learning when data is distributed over several devices or across organizations, typically referred to as nodes or clients. In this framework, the training data never leaves the client, and all computations using the data are performed locally. This is especially useful when data privacy is important, or when collecting and storing data centrally is expensive.

Federated learning can be grouped into one of two categories: centralized and decentralized. In centralized FL, a central server orchestrates the learning among clients and is responsible for parameter aggregation, after receiving parameter updates from clients. However, the central server in an FL setup is a potential point of weakness: it could fail or be maliciously attacked, which would make the distributed learning fail. Decentralized (peer-to-peer) systems without a central server are not vulnerable to this.

In decentralized federated learning, no global model state exists. Instead, the participating clients follow a communication protocol to reach a consensus of a model during training. Standard techniques for decentralized learning include gradient-based algorithms based on gossip learning (Boyd et al., 2006; Jelasity et al., 2007; Ormándi et al., 2013), where clients train their own model based on local data and follow a communication protocol where they randomly communicate (gossip) their model parameters with their neighbors. The goal for the participating clients is to reach a consensus on a good model. In this work, we focus on gradient-based learning algorithms in a decentralized federated setup.

Both centralized and decentralized federated learning approach the important question of how to learn a suitable personalized model when client data distributions differ, i.e. the setting of non-iid data. A lot of research is currently being done regarding this topic in the centralized setting. Meanwhile, this is a relatively understudied problem in the decentralized setting (Kairouz et al., 2019).

The main contribution of this paper is a novel, completely decentralized, federated algorithm for gradient-based methods when client data is non-iid: Performance-Based Neighbor Selection (Pens). In Pens

, clients with similar data distributions have a higher probability of collaborating and those with dissimilar data distributions have a lower probability of collaborating. We perform multiple experiments over different non-convex optimization using deep neural networks, and our results show that using

Pens leads to a higher performance as compared to all considered baselines.

2 Related work

Gossip learning. Gossip learning has been applied in many different machine learning settings (Kempe et al., 2003; Boyd et al., 2006; Ormándi et al., 2013). However, much of the previous work on gossip learning has been limited to settings where each client only stores a single data point. Further, it has been under-explored how non-convex optimization of neural networks works under the gossip learning protocol. In (Giaretta and Girdzijauskas, 2019)

, the authors study the performance of SVMs and linear regression models on non-iid data in a decentralized gossip learning setup. In

(Hegedűs et al., 2019)

the authors train and evaluate logistic regression models and compare gossip learning to federated learning with a central server. A gossip-based algorithm for strongly convex functions has been studied in

(Koloskova et al., 2019), where the authors prove that their proposed algorithm is linearly convergent with quantized communication.

The first decentralized work on gossip-based optimization for non-convex deep learning studied CNNs and experimentally showed that an asynchronous and decentralized framework achieved high accuracies with low communication costs (Blot et al., 2016). Training CNNs in a decentralized federated learning setting has also been applied for segmentation of brain images (Roy et al., 2019).

Some recent work study communication costs (peer-to-peer communication) in non-convex optimization for different types of network topologies (Assran et al., 2019; Wang et al., 2019a).

In (Kong et al., 2021), the authors identify the changing consensus distance between clients as key to explain the gap between centralized and decentralized training and focus on non-convex optimization.

Non-iid data.

All aforementioned works solve important problems. However, a key assumption is made in the studies: that data is independently and identically distributed (iid) over clients. The problem of non-iid is becoming more studied in the case of centralized federated learning. In this setting, solutions for skewed data distributions have been explored in many different ways including fine-tuning a global model locally

(Wang et al., 2019b), posing the personalization problem in FL as a meta-learning objective (Jiang et al., 2019), using knowledge distillation techniques (Jeong et al., 2018), by mixing local and global models (Deng et al., 2020; Listo Zec et al., 2020) and with data-sharing methods (Zhao et al., 2018). Meanwhile, similar techniques have not yet been widely applied and researched in decentralized federated learning.

In (Ghosh et al., 2020) they study the problem of covariate shift, similar to us, but in a centralized federated learning setup. In their paper, the authors develop a client clustering framework to learn one global model per cluster with a central parameter server.

Label distribution shift for decentralized deep learning has been studied in (Niwa et al., 2020). In this work, the authors propose to solve a dual problem that seeks to minimize a linearly constrained cost function. By solving a constrained optimization problem, their method achieves similar models among clients in the non-iid data setting.

In this work, we continue the research on the effectiveness of deep neural networks in decentralized peer-to-peer networks where data is non-iid. More specifically, we study the problem of covariate shift where varies, but the conditional distributions for all clients .

3 Problem formulation

We formulate the problem as an empirical risk minimization (ERM) problem, as commonly used in statistical learning setups. The goal is to learn weights for a model by optimizing some loss over data. In a decentralized setting, we have clients that are able to communicate with neighboring clients in a communication network. We assume that each client has a data distribution over input features and labels .

Let be the loss as a function of the model parameters and data points . Thus, the aim of the optimization is to minimize


In this work, we study the problem of covariate-shift. To do this, we create different distributions for each image dataset by rotating the images with degrees. is defined as a dataset where images have been rotated with 0 degrees of rotation, and with 180 degrees of rotation.

We perform experiments on two and four different data distributions, where or . The train and test sets for the studied datasets are randomly split into one equally large partition for each value of , and a rotation of is applied on each partition thus creating . Then each client is populated with training samples uniformly from one such data distribution. Since the labels are unchanged after rotation, this means that the marginal distributions over input features differ between these groups, but the conditional distributions are the same for all clients. This way of creating different client distributions has previously been used in (Ghosh et al., 2020) for centralized federated learning.

The main challenge of this paper is that we assume the client distributions are unknown for each client , and our goal is to design an algorithm that can both identify and at the same time perform distributed optimization.

4 Algorithm

In our decentralized peer-to-peer network, we use the gossip protocol for communication between clients. Below we describe the random gossip baseline, and our proposed extension pens.

4.1 Random gossip communication

In this framework, each client starts with a randomly initialized model that is updated using stochastic gradient descent (SGD) on the local client data for

local epochs. The model parameters

of client are then at a random time communicated to a randomly chosen neighboring client in the network. This action is denoted as Send. Client then waits for number of models before it aggregates its own current model with the received ones with a simple average: . This is the same aggregation method as commonly used in centralized federated learning. The new aggregated model is then trained for epochs before it is ready to be gossiped again. A summary of the gossip learning protocol is presented in algorithm 1.

4.2 Pens: Performance-based neighbor selection

A problem that arises with random gossip when we have different distributions for clients is that if two clients with dissimilar distributions (i.e. for client and for client ) communicate, the performance of the learned model is usually negatively effected when their models are aggregated.

To solve this problem, we introduce Pens: Performance-based neighbor selection. Pens consists of two main steps. In the first step, the algorithm finds clients of similar marginal distributions to communicate with. In the second step, the random gossip protocol is followed for the subset of clients selected from the first step. Clients of similar distributions are found by evaluating sent models on the receiving client’s training data. The main idea of the proposed method is that the training loss of a sent client model is expected to be lower on the training set of a receiving client that has a similar data distribution, and a higher loss for those clients that have dissimilar distributions.

First, each client communicates randomly in the network for a pre-defined number of neighbor selection communication rounds . At a random time, a client performs the Send operation, after which is calculated, where . This is the loss of client model on the training set of client . Each client waits for number of models and saves a list of the losses. The top best performing (lowest loss) clients are selected as potential neighbors with similar data distributions. Then their model parameters are aggregated into a new model. This is repeated for rounds, after which the clients that were selected more than the expected amount of times (if the sampling of clients would have been uniform) are identified as neighbors with similar data distributions.

A set of neighbors with similar data distributions are now identified for every client and this constitutes step 1 of Pens. This is summarized in algorithm 2. In step 2 of Pens, the gossip learning protocol (algorithm 1) is used for the set of selected neighbors for each client.

1:  function MAIN 
2:     while stopping criterion not met do
3:        WAIT()
4:         RANDOMPEER() // select random peer
5:        SEND()
6:     end while
7:  end function
8:  function ONRECEIVEMODEL(
9:     SAVE()
10:     if no. of received models  then
12:        TRAIN() //update on local data
13:     end if
14:  end function
Algorithm 1 Gossip learning protocol
1:  function MAIN 
2:     while stopping criterion not met do
3:        WAIT()
4:         RANDOMPEER() // select random peer
5:        SEND()
6:     end while
8:  end function
9:  function ONRECEIVEMODEL() 
11:     SAVE()
12:     if no. of received models  then
13:        MERGE(SELECT_TOP_M(,))
14:        TRAIN() //update on local data
15:        no. of received models //reset
16:     end if
17:  end function
18:  function SELECTNEIGHBORS 
19:     for all peers  do
20:        if merged with more than expected then
21:           NEIGHBORLIST.append()
22:        end if
23:     end for
24:     return: NEIGHBORLIST
25:  end function
Algorithm 2 PENS step 1: find peers with similar data distributions

5 Experimental setup

In this work, we set out to develop an algorithm to solve the problem of non-iid data for gradient-based algorithms in a peer-to-peer network. To do this, we limit the experiments to a peer-to-peer network that is fully connected (i.e. all nodes in the network can communicate with each other). Further, we assume that all clients are able to communicate at any time. We simulated the peer-to-peer network on a computer, and all experiments were performed with a NVIDIA Tesla V100-SXM2-32GB GPU. Our code is available at github 111

5.1 Datasets

Our experiments are performed on two datasets for visual classification, CIFAR-10

(Krizhevsky et al., 2009)

and Fashion-MNIST

(Xiao et al., 2017). The CIFAR-10 dataset consists of 60 000 32x32 color images in 10 classes, with 6000 images per class. The dataset is split into 50 000 training images and 10 000 test images. The Fashion-MNIST dataset contains 70 000 28x28 gray-scale images of Zalando clothing in 10 classes. It is split into 60 000 training images and 10 000 test images.

5.2 Models and hyperparameters

The CNN used in our experiments consists of three convolutional layers and ReLU activations (with 32 channels in the first layer and 64 channels in the last two layers), each with a kernel size of

and each followed by max pooling. This is followed by one fully connected layer of

units with a ReLU activation and an output layer with a softmax activation. The size and architecture of this network is not state-of-the-art for visual classification tasks but has sufficient capacity for the comparison that we perform in our experiments. We use SGD as our optimizer with a learning rate for pens. For the random gossip protocol we use and for pens and

, if not explicitly otherwise stated. All hyperparameters were tuned for all baselines and the best ones were chosen with respect to a local validation set on each client.

5.3 Baselines

We compare our proposed algorithm Pens with two baselines: random gossip (Random) and locally trained models without communication (Local) for each client. We also report results for an Oracle, that for each client is given the information of which neighbors have the same distributions for all , and only communicates with these neighbors. Accuracy for a centrally trained model is also presented, denoted by Central, where we train one model on all data non-distributed.

5.4 Evaluation

For testing of all algorithms, we measure test accuracy for each client on test data from the client’s own distribution . All reported accuracies are averaged over clients for each distribution. We run 4 experiments for each algorithm, with different random seeds, and report the average and a confidence interval. For step 1 of pens, we let clients communicate communication rounds. During step 2 of pens, Random and Oracle, we set and we perform early stopping on local validation data for each client between client communication rounds. The communication is stopped when the validation loss has converged or when we reach . The validation sets consist of 100 sample points from per client in all experiments.

6 Results and discussion

In table 1, results on CIFAR-10 are shown for . Accuracies are reported both for independent and common weight initialization for the client models. In centralized federated learning, it is known that a common initialization of client models is important for federated averaging to work (McMahan et al., 2017). Meanwhile, our results suggest that a common initialization is not necessary in order for the different algorithms to reach a high accuracy in decentralized federated learning. Further, the proposed method pens achieves an accuracy that is higher than all baselines and close to the performance of Oracle with perfect information of client data distributions . In table 2 we see that pens outperforms the baselines also in the case of .

In table 3, accuracies for all algorithms on Fashion-MNIST with are presented for 100 and 500 training samples per client, with the number of clients set to 100. Although the difference of test accuracy between the baselines is smaller as compared to CIFAR-10 (since Fashion-MNIST is an easier problem), our proposed method pens outperforms both baselines in this setting as well.

(a) Number of clients fixed to 100. Training set size per client varying.
(b) Train set size per client fixed to 150. Number of clients varying.
Figure 3: Test accuracy on CIFAR-10 with as a function of (a) training samples per client and (b) number of clients, while fixing the other. Oracle has perfect information of client distributions, as opposed to the other methods.

6.1 Impact of training set size

In figure (a)a we compare pens to the baseline algorithms on the CIFAR-10 dataset in a setting where we fix the number of clients to 100 while at the same time varying the size of the local train sets. The results show that by increasing the size of the local train set on each client, performance increases for all compared algorithms. We further see that pens consistently outperforms both baselines. In a low data setting, with 100 training samples per client, pens is closer to Random in performance as compared to Oracle. However, when the training size increases, the difference between Random and pens increases, as pens manages to find the correct neighbors with similar data distributions for each client. This is further visualized in figure 7, where a heatmap over the communication pattern is plotted for Oracle, pens, and Random. Here we see that for each client, pens manages to find almost all clients with similar data distributions.

In table 5 we report for precision (the fraction of clients with the same distribution among the selected clients) and recall (the fraction of clients with the same distribution ) for the peers that pens selects for each client. Here we report experiments for 100 clients for a different number of training samples per client. We note that the precision of our method is very high and robust to the number of training samples per client. The recall is lower than the precision, but also robust to the number of training samples.

6.2 Impact of number of clients

Figure (b)b shows results for experiments for CIFAR-10 where the number of samples per client is fixed to 250 (150 train and 100 validation samples), but with a varying number of clients in the peer-to-peer network. Since no communication is allowed for the local baseline, the performance is constant with respect to the number of clients. Meanwhile, for the other methods, we see that by adding more clients (and therefore also the total amount of data in the system) the performance increases. Further, our proposed method pens consistently outperforms the random baseline.

Method Acc. (independent) Acc. (common)
Table 1: Test accuracy reported on CIFAR-10. Independent and common model weight initialization. 100 clients, 400 training samples.

6.3 Impact of and top performers

The parameter in step 1 of pens decides for each client how many other client models to sample at every communication round, and decides how many of the top-performing (lowest loss) to merge with. Experiments were carried out to study how sensitive pens is to the choice of these hyperparameters. In table 4 we summarize the results for varying values of these hyperparameters in the setting of 100 clients and 400 training samples per client for CIFAR-10. We note that the test accuracies are relatively stable for different values of and . Further, our results suggest that the ratio should not be too large, i.e. if is increased should be higher as well. We have noticed in our experiments that if is set too low relative to , pens will collapse into always choosing the same few clients and miss to find other peers of the same distribution.

Method Accuracy (%)
Table 2: Test accuracies on CIFAR-10 with 4 rotated distributions , . 100 clients and 400 training samples per client.
Method 100 500
Table 3: Test accuracy reported on Fashion-MNIST for 100 clients with 100 and 500 training samples per client and .
Accuracy (%)
20 6 3.3
50 15 3.3
10 2 5.0
10 3 3.3
50 10 5.0
20 4 5.0
10 5 2.0
5 2 2.5
10 1 10.0
5 1 5.0
20 10 2.0
20 2 10.0
5 3 1.7
50 25 2.0
50 5 10.0
Table 4: Accuracies for varying and on CIFAR-10 for 100 clients with 150 training samples per client, .
Train set size Precision (%) Recall (%)
Table 5: Precision and recall for 100 clients for different number of training samples and .
(a) Oracle
(b) pens
(c) Random
Figure 7: Heatmaps over the communication pattern between clients on CIFAR-10 with two rotated for (a) Oracle (b) pens and (c) random. The reported values are normalized counts: a value close to 1 means that client received a model from client frequently. The clients are sorted so that clients with ID 0-99 have data distributions and clients with ID 100-199 have .

7 Future work

There are several interesting research directions left to explore which we had to limit ourselves from including in this paper. First, we assumed that all clients were able to communicate equally fast and at all times. This is a strong assumption that is not always true in many real world applications, and it would therefore be interesting to study system heterogeneity in a decentralized network where clients have different hardware and computational budgets. Second, we assumed that the decentralized network topology was fully connected. As future work, it would be interesting to study how pens performs on other types of network topologies and how these effect the learning among clients.

8 Conclusions

In this work we have studied non-convex optimization of decentralized federated learning using deep neural networks in a non-iid data setting. Our experiments show that our proposed method pens is able to efficiently aid clients to identify neighboring peers with similar data distributions in a fully decentralized FL setting and in that way guide the learning of the clients to achieve high performance in a non-iid data setting. pens works by using the training loss to find clients in the network that share similar data distributions, which then focus on communicating with each other instead of random neighbors. Our results (figures 3 and 7) suggest that given enough training data per client, pens will reach the same accuracy as an oracle that is given perfect information of the client data distributions. We have limited ourselves to the non-iid setting of covariate shift in this work. Meanwhile, we hypothesize that our proposed method also works well on other types of non-iid data, such as label distribution skew, concept shift or concept drift.


  • M. Assran, N. Loizou, N. Ballas, and M. Rabbat (2019) Stochastic gradient push for distributed deep learning. In International Conference on Machine Learning, pp. 344–353. Cited by: §2.
  • M. Blot, D. Picard, M. Cord, and N. Thome (2016) Gossip training for deep learning. arXiv preprint arXiv:1611.09726. Cited by: §2.
  • S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah (2006) Randomized gossip algorithms. IEEE transactions on information theory 52 (6), pp. 2508–2530. Cited by: §1, §2.
  • Y. Deng, M. M. Kamani, and M. Mahdavi (2020) Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461. Cited by: §2.
  • A. Ghosh, J. Chung, D. Yin, and K. Ramchandran (2020) An efficient framework for clustered federated learning. arXiv preprint arXiv:2006.04088. Cited by: §2, §3.
  • L. Giaretta and Š. Girdzijauskas (2019) Gossip learning: off the beaten path. In 2019 IEEE International Conference on Big Data (Big Data), pp. 1117–1124. Cited by: §2.
  • I. Hegedűs, G. Danner, and M. Jelasity (2019) Gossip learning as a decentralized alternative to federated learning. In IFIP International Conference on Distributed Applications and Interoperable Systems, pp. 74–90. Cited by: §2.
  • M. Jelasity, S. Voulgaris, R. Guerraoui, A. Kermarrec, and M. Van Steen (2007) Gossip-based peer sampling. ACM Transactions on Computer Systems (TOCS) 25 (3), pp. 8–es. Cited by: §1.
  • E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S. Kim (2018) Communication-efficient on-device machine learning: federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479. Cited by: §2.
  • Y. Jiang, J. Konečnỳ, K. Rush, and S. Kannan (2019) Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488. Cited by: §2.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §1.
  • D. Kempe, A. Dobra, and J. Gehrke (2003) Gossip-based computation of aggregate information. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp. 482–491. Cited by: §2.
  • A. Koloskova, S. Stich, and M. Jaggi (2019) Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pp. 3478–3487. Cited by: §2.
  • L. Kong, T. Lin, A. Koloskova, M. Jaggi, and S. U. Stich (2021) Consensus control for decentralized deep learning. arXiv preprint arXiv:2102.04828. Cited by: §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.1.
  • E. Listo Zec, O. Mogren, J. Martinsson, L. R. Sütfeld, and D. Gillblad (2020) Specialized federated learning using a mixture of experts. arXiv preprint arXiv:2010.02056. Cited by: §2.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §1, §6.
  • K. Niwa, N. Harada, G. Zhang, and W. B. Kleijn (2020) Edge-consensus learning: deep learning on p2p networks with nonhomogeneous data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 668–678. Cited by: §2.
  • R. Ormándi, I. Hegedűs, and M. Jelasity (2013) Gossip learning with linear models on fully distributed data. Concurrency and Computation: Practice and Experience 25 (4), pp. 556–571. Cited by: §1, §2.
  • A. G. Roy, S. Siddiqui, S. Pölsterl, N. Navab, and C. Wachinger (2019) Braintorrent: a peer-to-peer environment for decentralized federated learning. arXiv preprint arXiv:1905.06731. Cited by: §2.
  • J. Wang, A. K. Sahu, Z. Yang, G. Joshi, and S. Kar (2019a) MATCHA: speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pp. 299–300. Cited by: §2.
  • K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays, and D. Ramage (2019b) Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252. Cited by: §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.1.
  • Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §2.