1 Introduction
Machine learning is currently shifting from a centralized paradigm, in which models are trained on data located on a single machine or in a data center, to decentralized ones. Effectively, the latter paradigm closely matches the natural data distribution in the numerous usecases where data is collected and processed by several independent parties (hospitals, companies, personal devices…). Federated Learning (FL) allows a set of participants to collaboratively train machine learning models on their joint data while keeping it where it has been produced. Not only does this avoid the costs of moving data, but it also mitigates privacy and confidentiality concerns [10]. Yet, working with natural data distributions introduces new challenges for learning systems, as local datasets reflect the usage and production patterns specific to each participant: they are not independent and identically distributed (nonIID). More specifically, the relative frequency of different classes of examples may significantly vary across local datasets [10, 8]. Therefore, one of the key challenges in FL is to design algorithms that can efficiently deal with such nonIID data distributions [10, 18, 11, 8].
Federated learning algorithms can be classified into two categories depending on the underlying network topology they run on. In serverbased FL, the network is organized according to a star topology: a central server orchestrates the training process by iteratively aggregating model updates received from the participants (
clients) and sending back the aggregated model [23]. In contrast, fully decentralized FL algorithms operate over an arbitrary network topology where participants communicate only with their direct neighbors in the network. A classic example of such algorithms is Decentralized SGD (DSGD) [19], in which participants alternate between local SGD updates and model averaging with neighboring nodes.In this paper, we focus on fully decentralized algorithms as they can generally scale better to the large number of participants seen in “crossdevice” applications [10]. Effectively, while a central server may quickly become a bottleneck as the number of participants increases, the topology used in fully decentralized algorithms can remain sparse enough such that all participants need only to communicate with a small number of other participants, i.e. nodes have small (constant or logarithmic) degree [19]. For IID data, recent work has shown both empirically [19, 20] and theoretically [25] that sparse topologies like rings or grids do not significantly affect the convergence speed compared to using denser topologies.
In contrast to the IID case however, our experiments demonstrate that the impact of topology is extremely significant for nonIID data. This phenomenon is illustrated in Figure 1: We observe that a ring or a grid topology clearly jeopardizes the convergence speed as local distributions do not have relative frequency of classes similar to the global distribution, i.e. they exhibit local class bias. We stress the fact that, unlike in centralized FL [10, 11, 8], this happens even when nodes perform a single local update before averaging the model with their neighbors. In this paper, we address the following question:
Can we design sparse topologies with convergence speed similar to the one obtained in a fully connected network under a large number of participants with local class bias?
IID vs nonIID convergence speed of decentralized SGD for logistic regression on MNIST for different topologies. Bold lines show the average test accuracy across nodes while thin lines show the minimum and maximum accuracy of individual nodes. While the effect of topology is negligible for IID data, it is very significant in the nonIID case. When fullyconnected, both cases converge similarly. See Section
2.2.2 for details on the experimental setup.Specifically, we make the following contributions: (1) We propose DCliques, a sparse topology in which nodes are organized in interconnected cliques, i.e. locally fullyconnected sets of nodes, such that the joint data distribution of each clique is representative of the global (IID) distribution; (2) We propose Clique Averaging, a modified version of the standard DSGD algorithm which decouples gradient averaging, used for optimizing local models, from distributed averaging, used to ensure all models converge, therefore reducing the bias introduced by interclique connections; (3) We show how Clique Averaging can be used to implement unbiased momentum that would otherwise be detrimental in the nonIID setting; (4) We demonstrate through an extensive experimental study that our approach removes the effect of the local class bias on the MNIST [16] and CIFAR10 [14] datasets, for training a linear model and a deep convolutional network; (5) Finally, we demonstrate the scalability of our approach by considering up to 1000node networks, in contrast to most previous work on fully decentralized learning that considers only a few tens of nodes [29, 25, 21, 4, 13].
For instance, our results show that using DCliques in a 1000node network requires 98% less edges ( vs edges per participant on average), thereby yielding a 96% reduction in the total number of required messages (37.8 messages per round per node on average instead of 999), to obtain a similar convergence speed as a fullyconnected topology. Furthermore an additional 22% improvement is possible when using a smallworld interclique topology, with further potential gains at larger scales because of its linearlogarithmic scaling.
The rest of this paper is organized as follows. We first present the problem statement and our methodology (Section 2). The DCliques design is presented in Section 3) along with an empirical illustration of its benefits. In Section 4, we show how to further reduce bias with Clique Averaging and how to use it to implement momentum. We present the results or our extensive experimental study in Section 5. We review some related work in Section 6, and conclude with promising directions for future work in Section 7.
2 Problem Statement
We consider a set of nodes seeking to collaboratively solve a classification task with classes. Each node has access to a local dataset that follows its own local distribution . The goal is to find a global model that performs well on the union of the local distributions by minimizing the average training loss:
(1) 
where is a data example drawn from and
is the loss function on node
. Therefore, denotes the expected loss of model on a random example drawn from .To collaboratively solve Problem (1), each node can exchange messages with its neighbors in an undirected network graph where denotes an edge (communication channel) between nodes and .
2.1 Training Algorithm
In this work, we use the popular Decentralized Stochastic Gradient Descent algorithm, aka DSGD [19]. As shown in Algorithm 1, a single iteration of DSGD at node consists of sampling a minibatch from its local distribution , updating its local model by taking a stochastic gradient descent (SGD) step according to the minibatch, and performing a weighted average of its local model with those of its neighbors. This weighted average is defined by a mixing matrix , in which corresponds to the weight of the outgoing connection from node to and for . To ensure that the local models converge on average to a stationary point of Problem (1), must be doubly stochastic ( and ) and symmetric, i.e. [19].
2.2 Methodology
2.2.1 NonIID assumptions.
As demonstrated in Figure 1, lifting the assumption of IID data significantly challenges the learning algorithm. In this paper, we focus on an extreme case of local class bias: we consider that each node only has examples from a single class.
To isolate the effect of local class bias from other potentially compounding factors, we make the following simplifying assumptions: (1) All classes are equally represented in the global dataset; (2) All classes are represented on the same number of nodes; (3) All nodes have the same number of examples.
We believe that these assumptions are reasonable in the context of our study because: (1) Global class imbalance equally affects the optimization process on a single node and is therefore not specific to the decentralized setting; (2) Our results do not exploit specific positions in the topology; (3) Imbalanced dataset sizes across nodes can be addressed for instance by appropriately weighting the individual loss functions. Our results can be extended to support additional compounding factors in future work.
2.2.2 Experimental setup.
Our main goal is to provide a fair comparison of the convergence speed across different topologies and algorithmic variations, in order to show that our approach can remove much of the effect of local class bias.
We experiment with two datasets: MNIST [16] and CIFAR10 [14], which both have classes. For MNIST, we use 45k and 10k examples from the original 60k training set for training and validation respectively. The remaining 5k training examples were randomly removed to ensure all 10 classes are balanced while ensuring that the dataset is evenly divisible across 100 and 1000 nodes. We use all 10k examples of the test set to measure prediction accuracy. For CIFAR10, classes are evenly balanced: we use 45k/50k images of the original training set for training, 5k/50k for validation, and all 10k examples of the test set for measuring prediction accuracy.
We use a logistic regression classifier for MNIST, which provides up to 92.5% accuracy in the centralized setting. For CIFAR10, we use a GroupNormalized variant of LeNet [8], a deep convolutional network which achieves an accuracy of in the centralized setting. These models are thus reasonably accurate (which is sufficient to study the effect of the topology) while being sufficiently fast to train in a fully decentralized setting and simple enough to configure and analyze. Regarding hyperparameters, we jointly optimize the learning rate and minibatch size on the validation set for 100 nodes, obtaining respectively and for MNIST and and for CIFAR10. For CIFAR10, we additionally use a momentum of .
We evaluate 100 and 1000node networks by creating multiple models in memory and simulating the exchange of messages between nodes. To ignore the impact of distributed execution strategies and system optimization techniques, we report the test accuracy of all nodes (min, max, average) as a function of the number of times each example of the dataset has been sampled by a node, i.e. an epoch
. This is equivalent to the classic case of a single node sampling the full distribution. To further make results comparable across different number of nodes, we lower the batch size proportionally to the number of nodes added, and inversely, e.g. on MNIST, 128 with 100 nodes vs. 13 with 1000 nodes. This ensures the same number of model updates and averaging per epoch, which is important to have a fair comparison.
^{1}^{1}1Updating and averaging models after every example can eliminate the impact of local class bias. However, the resulting communication overhead is impractical.Finally, we compare our results against an ideal baseline: either a fullyconnected network topology with the same number of nodes or a single IID node. In both cases, the topology has no effect on the optimization. For a certain choice of number of nodes and minibatch size, both approaches are equivalent.
3 DCliques: Creating Locally Representative Cliques
In this section, we present the design of DCliques. To give an intuition of our approach, let us consider the neighborhood of a single node in a grid similar to that of Figure 0(b), represented on Figure 2. The colors of a node represent the different classes present in its local dataset. In the IID setting (Figure 1(a)), each node has examples of all classes in equal proportions. In the nonIID setting (Figure 1(b)), each node has examples of only a single class and nodes are distributed randomly in the grid.
A single training step, from the point of view of the center node, is equivalent to sampling a minibatch five times larger from the union of the local distributions of all illustrated nodes. In the IID case, since gradients are computed from examples of all classes, the resulting averaged gradient points in a direction that tends to reduce the loss across all classes. In contrast, in the nonIID case, only a subset of classes are represented in the immediate neighborhood of the node, thus the gradients will be biased towards these classes. Importantly, as the distributed averaging algorithm takes several steps to converge, this variance persists across iterations as the locally computed gradients are far from the global average.
^{2}^{2}2It is possible, but very costly, to mitigate this by performing a sufficiently large number of averaging steps between each gradient step. This can significantly slow down convergence speed to the point of making decentralized optimization impractical.In DCliques, we address the issues of noniidness by carefully designing a network topology composed of cliques and interclique connections:

DCliques recover a balanced representation of classes, similar to that of the IID case, by constructing a topology such that each node is part of a clique with neighbors representing all classes.

To ensure a global consensus and convergence, interclique connections are introduced by connecting a small number of node pairs that are part of different cliques.
In the following, we introduce up to one interclique connection per node such that each clique has exactly one edge with all other cliques, see Figure 2(a) for the corresponding DCliques network in the case of nodes and classes. We will explore sparser interclique topologies in Section 5.2.
The mixing matrix required by DSGD is obtained from standard MetropolisHasting weights [32] computed from the above topology, namely:
(2) 
We refer to Algorithm 3 in the appendix for a formal account of DCliques construction. We note that it only requires the knowledge of the local class distribution at each node. For the sake of simplicity, we assume that DCliques is constructed from the global knowledge of these distributions, which can easily be obtained by decentralized averaging in a preprocessing step.
The key idea of DCliques is that because the cliquelevel distribution is representative of the global distribution, the local models of nodes across cliques remain rather close. Therefore, a sparse interclique topology can be used, significantly reducing the total number of edges without slowing down the convergence. Furthermore, the degree of each node in the network remains low and even, making the DCliques topology very wellsuited to decentralized federated learning.
Figure 2(b) illustrates the performance of DCliques on MNIST with nodes. Observe that the convergence speed is very close to that of a fullyconnected topology, and significantly better than with a ring or a grid (see Figure 1). With 100 nodes, it offers a reduction of in the number of edges compared to a fullyconnected topology. Nonetheless, there is still significant variance in the accuracy across nodes, which is due to the bias introduced by interclique edges. We address this issue in the next section.
4 Optimizing with Clique Averaging and Momentum
In this section, we present Clique Averaging. This feature, when added to DSGD, removes the bias caused by the intercliques edges of DCliques. We also show how it can be used to successfully implement momentum for nonIID data.
4.1 Clique Averaging: Debiasing Gradients from InterClique Edges
While limiting the number of interclique connections reduces the amount of messages traveling on the network, it also introduces its own bias. Figure 4 illustrates the problem on the simple case of two cliques connected by one interclique edge (here, between the green node of the left clique and the purple node of the right clique). Let us focus on node A. With weights computed as in (2), node A’s selfweight is , the weight between A and the green node connected to B is , and all other neighbors of A have a weight of . Therefore, the gradient at A is biased towards its own class (purple) and against the green class. A similar bias holds for all other nodes without interclique edges with respect to their respective classes. For node B, all its edge weights (including its selfweight) are equal to . However, the green class is represented twice (once as a clique neighbor and once from the interclique edge), while all other classes are represented only once. This biases the gradient toward the green class. The combined effect of these two sources of bias is to increase the variance of the local models across nodes.
We address this problem by adding Clique Averaging to DSGD (Algorithm 2), which essentially decouples gradient averaging from model averaging. The idea is to use only the gradients of neighbors within the same clique to compute the average gradient, providing an equal representation to all classes. In contrast, all neighbors’ models, including those across interclique edges, participate in the model averaging step as in the original version.
As illustrated in Figure 5, this significantly reduces the variance of models across nodes and accelerates convergence to reach the same level as the one obtained with a fullyconnected topology. Note that Clique Averaging induces a small additional cost, as gradients and models need to be sent in two separate rounds of messages. Nonetheless, compared to fully connecting all nodes, the total number of messages is reduced by .
4.2 Implementing Momentum with Clique Averaging
Efficiently training high capacity models usually requires additional optimization techniques. In particular, momentum [28] increases the magnitude of the components of the gradient that are shared between several consecutive steps, and is critical for deep convolutional networks like LeNet [15, 8] to converge quickly. However, a direct application of momentum in a nonIID setting can actually be very detrimental. As illustrated in Figure 5(a) for the case of LeNet on CIFAR10 with 100 nodes, DCliques with momentum even fails to converge. Not using momentum actually gives a faster convergence, but there is a significant gap compared to the case of a single IID node with momentum.
We show here that Clique Averaging (Section 4.1) allows us to compute an unbiased momentum from the unbiased average gradient of Algorithm 2:
(3) 
It then suffices to modify the original gradient step to use momentum:
(4) 
As shown in Figure 5(b), this simple modification restores the benefits of momentum and closes the gap with the centralized setting.
5 Comparative Evaluation and Extensions
In this section, we first compare DCliques to alternative topologies to confirm the relevance of our main design choices. Then, we evaluate some extensions of DCliques to further reduce the number of interclique connections so as to gracefully scale with the number of nodes.
5.1 Comparing DCliques to Other Sparse Topologies
We demonstrate the advantages of Dcliques over alternative sparse topologies that have a similar number of edges. First, we consider topologies in which the neighbors of each node are selected at random (hence without any clique structure). Specifically, for nodes, we construct a random topology such that each node has exactly 10 edges, which is similar to the average 9.9 edges of our DCliques topology (Figure 2(a)). To better understand the role of the clique structure beyond merely ensuring class representativity among neighbors, we also compare to a random topology similar to the one described above except that edges are chosen such that each node has neighbors of all possible classes. Finally, we also implement an analog of Clique Averaging for these random topologies, where all nodes debias their gradient based on the class distribution of their neighbors. In the latter case, since nodes do not form a clique, each node obtains a different average gradient.
The results for MNIST and CIFAR10 are shown in Figure 7. For MNIST, a purely random topology has higher variance and lower convergence speed than DCliques (with or without Clique Averaging), while a random topology with class representativity performs similarly as DCliques without Clique Averaging. However and perhaps surprisingly, a random topology with unbiased gradient performs slightly worse than without it. In any case, DCliques with Clique Averaging outperforms all random topologies, showing that the clique structure has a small but noticeable effect on the average accuracy and significantly reduces the variance across nodes in this setup.
On the harder CIFAR10 dataset with a deep convolutional network, the differences are much more dramatic: DCliques with Clique Averaging and momentum turns out to be critical for fast convergence. Crucially, all random topologies fail to converge to a good solution. This confirms that our clique structure is important to reduce variance across nodes and improve the convergence. The difference with the previous experiment seems to be due to both the use of a higher capacity model and to the intrinsic characteristics of the datasets.
While the previous experiments suggest that our clique structure is instrumental in obtaining good performance, one may wonder whether intraclique full connectivity is actually necessary. Figure 8 shows the convergence speed of a DCliques topology where cliques have been sparsified by randomly removing 1 or 5 edges per clique (out of 45). Strikingly, both for MNIST and CIFAR10, removing just a single edge from the cliques has a significant effect on the convergence speed. On CIFAR10, it even entirely negates the benefits of DCliques.
Overall, these results show that achieving fast convergence on nonIID data with sparse topologies requires a very careful design, as we have proposed with DCliques.
5.2 Scaling up DCliques with Sparser InterClique Topologies
So far, we have used a fullyconnected interclique topology for DCliques, which has the advantage of bounding the average shortest path to between any pair of nodes. This choice requires interclique edges, which scales quadratically in the number of nodes. This can become significant at larger scales when is large compared to .
In this last series of experiments, we evaluate the effect of choosing sparser interclique topologies on the convergence speed for a larger network of 1000 nodes. We compare the scalability and convergence speed of several DCliques variants, which all use edges to create cliques as a starting point.
The interclique topology with (almost) fewest possible edges is a ring, which uses interclique edges and therefore scales linearly in . We also consider another topology that scales linearly and achieves a logarithmic bound on the average shortest number of hops between two nodes. In this hierarchical scheme that we call fractal, cliques are assembled in larger groups of cliques that are connected internally with one edge per pair of cliques, but with only one edge between pairs of larger groups. The topology is built recursively such that groups will themselves form a larger group at the next level up. This results in at most edges per node if edges are evenly distributed, and therefore also scales linearly in the number of nodes.
Finally, we propose to connect cliques according to a smallworldlike topology [31] applied on top of a ring [27]. In this scheme, cliques are first arranged in a ring. Then each clique adds symmetric edges, both clockwise and counterclockwise on the ring, with the closest cliques in sets of cliques that are exponentially bigger the further they are on the ring (see Algorithm 4 in the appendix for details on the construction). This ensures a good connectivity with other cliques that are close on the ring, while still keeping the average shortest path small. This scheme uses interclique edges and therefore grows in the order of with the number of nodes.
Figure 9 shows the convergence speed of all the above schemes on MNIST and CIFAR10, compared to the ideal baseline of a single IID node performing the same number of updates per epoch (representing the fastest convergence speed achievable if topology had no impact). The ring topology converges but is much slower, while our fractal scheme helps significantly. The sweet spot appears to be the smallworld topology, as the convergence speed is almost the same as with a fullyconnected interclique topology but with 22% less edges (14.5 edges on average instead of 18.9). Note that we can expect bigger gains at larger scales. Nonetheless, we stress the fact that even the fullyconnected topology offers significant benefits with 1000 nodes, as it represents a 98% reduction in the number of edges compared to fully connecting individual nodes (18.9 edges on average instead of 999) and a 96% reduction in the number of messages (37.8 messages per round per node on average instead of 999). We refer to Appendix 0.B for additional results comparing the convergence speed across different number of nodes. Overall, these results show that DCliques can nicely scale with the number of nodes.
6 Related Work
In this section, we review some related work on dealing with nonIID data in federated learning, and on the role of topology in fully decentralized algorithms.
Dealing with nonIID data in serverbased FL.
NonIID data is not much of an issue in serverbased FL if clients send their parameters to the server after each gradient update. Problems arise when one seeks to reduce the number of communication rounds by allowing each participant to perform multiple local updates, as in the popular FedAvg algorithm [23]. Indeed, nonIID data can prevent such algorithms from converging to a good solution [8, 11]. This led to the design of algorithms that are specifically designed to mitigate the impact of nonIID data while performing multiple local updates, using adaptive client sampling [8], update corrections [11] or regularization in the local objective [18]. Another direction is to embrace the nonIID scenario by learning personalized models for each client [26, 6, 5, 2]. We note that recent work explores rings of serverbased topologies [17], but the focus is not on dealing with nonIID data but to make serverbased FL more scalable to a large number of clients.
Dealing with nonIID data in fully decentralized FL.
NonIID data is known to negatively impact the convergence speed of fully decentralized FL algorithms in practice [7]. Aside from approaches that aim to learn personalized models [30, 33], this motivated the design of algorithms with modified updates based on variance reduction [29], momentum correction [21], crossgradient aggregation [4], or multiple averaging steps between updates (see [13] and references therein). These algorithms typically require significantly more communication and/or computation, and have only been evaluated on smallscale networks with a few tens of nodes.^{3}^{3}3We also observed that [29] is subject to numerical instabilities when run on topologies other than rings. When the rows and columns of do not exactly sum to (due to finite precision), these small differences get amplified by the proposed updates and make the algorithm diverge. In contrast, DCliques focuses on the design of a sparse topology which is able to compensate for the effect of nonIID data and scales to large networks. We do not modify the simple and efficient DSGD algorithm [19] beyond removing some neighbor contributions that otherwise bias the gradient direction.
Impact of topology in fully decentralized FL.
It is well known that the choice of network topology can affect the convergence of fully decentralized algorithms. In theoretical convergence rates, this is typically accounted for by a dependence on the spectral gap of the network, see for instance [3, 1, 19, 24]. However, for IID data, practice contradicts these classic results as fully decentralized algorithms have been observed to converge essentially as fast on sparse topologies like rings or grids as they do on a fully connected network [19, 20]. Recent work [25, 13] sheds light on this phenomenon with refined convergence analyses based on differences between gradients or parameters across nodes, which are typically smaller in the IID case. However, these results do not give any clear insight regarding the role of the topology in the nonIID case. We note that some work has gone into designing efficient topologies to optimize the use of network resources (see e.g., [22]), but the topology is chosen independently of how data is distributed across nodes. In summary, the role of topology in the nonIID data scenario is not well understood and we are not aware of prior work focusing on this question. Our work is the first to show that an appropriate choice of datadependent topology can effectively compensate for nonIID data.
7 Conclusion
We proposed DCliques, a sparse topology that recovers the convergence speed of a fullyconnected network in the presence of local class bias. DCliques is based on assembling subsets of nodes into cliques such that the cliquelevel class distribution is representative of the global distribution, thereby locally recovering IIDness. Cliques are joined in a sparse interclique topology so that they quickly converge to the same model. We proposed Clique Averaging to remove the nonIID bias in gradient computation by averaging gradients only with other nodes within the clique. Clique Averaging can in turn be used to implement unbiased momentum to recover the convergence speed usually only possible with IID minibatches. Through our experiments, we showed that the clique structure of DCliques is critical in obtaining these results and that a smallworld interclique topology with only edges achieves the best compromise between convergence speed and scalability with the number of nodes.
DCliques thus appears to be very promising to reduce bandwidth usage on FL servers and to implement fully decentralized alternatives in a wider range of applications where global coordination is impossible or costly. For instance, the presence and relative frequency of classes in each node could be computed using PushSum [12], and the topology could be constructed in a decentralized and adaptive way with PeerSampling [9]
. This will be investigated in future work. We also believe that our ideas can be useful to deal with more general types of data nonIIDness beyond the important case of local class bias that we studied in this paper. An important example is covariate shift or feature distribution skew
[10], for which local density estimates could be used as basis to construct cliques that approximately recover the global distribution.
8 Acknowledgments
This research was partially supported by French grants ANR16CE230016 (Project PAMELA) and ANR20CE230015 (Project PRIDE), and by the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement No. 825081 COMPRISE.
References
 [1] (2016) Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions. In ICML, Cited by: §6.
 [2] (2020) Personalized Federated Learning with Moreau Envelopes. In NeurIPS, Cited by: §6.
 [3] (2012) Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic Control 57 (3), pp. 592–606. Cited by: §6.
 [4] (2021) CrossGradient Aggregation for Decentralized Learning from NonIID data. Technical report arXiv:2103.02051. Cited by: §1, §6.
 [5] (2020) Personalized Federated Learning with Theoretical Guarantees: A ModelAgnostic MetaLearning Approach. In NeurIPS, Cited by: §6.
 [6] (2020) Lower Bounds and Optimal Algorithms for Personalized Federated Learning. In NeurIPS, Cited by: §6.
 [7] (2021) Decentralized learning works: An empirical comparison of gossip learning and federated learning. Journal of Parallel and Distributed Computing 148, pp. 109–124. Cited by: §6.
 [8] (2020) The NonIID Data Quagmire of Decentralized Machine Learning. In ICML, Cited by: §1, §1, §2.2.2, §4.2, §6.
 [9] (2007) Gossipbased peer sampling. ACM Transactions on Computer Systems (TOCS) 25 (3), pp. 8–es. Cited by: §7.
 [10] (2019) Advances and Open Problems in Federated Learning. Technical report arXiv:1912.04977. Cited by: §1, §1, §1, §7.
 [11] (2020) SCAFFOLD: Stochastic Controlled Averaging for OnDevice Federated Learning. In ICML, Cited by: §1, §1, §6.
 [12] (2003) Gossipbased Computation of Aggregate Information. Foundations of Computer Science. Cited by: §7.

[13]
(2021)
Consensus Control for Decentralized Deep Learning
. Technical report arXiv:2102.04828. Cited by: §1, §6, §6.  [14] (2009) Learning Multiple Layers of Features from Tiny Images. Note: https://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf Cited by: §1, §2.2.2.
 [15] (1998) Gradientbased Learning Applied to Document Recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.
 [16] (2020) The MNIST database of handwritten digits. Note: http://yann.lecun.com/exdb/mnist/ Cited by: §1, §2.2.2.
 [17] (2020) TornadoAggregate: accurate and scalable federated learning via the ringbased architecture. Technical report arXiv:2012.03214. Cited by: §6.
 [18] (2020) Federated Optimization in Heterogeneous Networks. In MLSys, Cited by: §1, §6.
 [19] (2017) Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NIPS, Cited by: §1, §1, §2.1, §6, §6.
 [20] (2018) Asynchronous Decentralized Parallel Stochastic Gradient Descent. In ICML, Cited by: §1, §6.
 [21] (2021) QuasiGlobal Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data. Technical report arXiv:2102.04761. Cited by: §1, §6.
 [22] (2020) ThroughputOptimal Topology Design for CrossSilo Federated Learning. In NeurIPS, Cited by: §6.
 [23] (2017) Communicationefficient learning of deep networks from decentralized data. In AISTATS, Cited by: §1, §6.
 [24] (2018) Network Topology and CommunicationComputation Tradeoffs in Decentralized Optimization. Proceedings of the IEEE 106 (5), pp. 953–976. Cited by: §6.
 [25] (2020) Decentralized gradient methods: does topology matter?. In AISTATS, Cited by: §1, §1, §6.
 [26] (2017) Federated MultiTask Learning. In NIPS, Cited by: §6.
 [27] (2003) Chord: a scalable peertopeer lookup protocol for internet applications. IEEE/ACM Transactions on networking 11 (1), pp. 17–32. Cited by: §5.2.
 [28] (2013) On the importance of initialization and momentum in deep learning. In ICML, Cited by: §4.2.
 [29] (2018) : Decentralized Training over Decentralized Data. In ICML, Cited by: §1, §6, footnote 3.
 [30] (2017) Decentralized Collaborative Learning of Personalized Models over Networks. In AISTATS, Cited by: §6.
 [31] (2000) Small worlds: the dynamics of networks between order and randomness. Princeton University Press. Cited by: §5.2.
 [32] (2004) Fast linear iterations for distributed averaging. Systems & Control Letters 53 (1), pp. 65–78. Cited by: §3.
 [33] (2020) Fully Decentralized Joint Learning of Personalized Models and Collaboration Graphs. In AISTATS, Cited by: §6.
Appendix 0.A Detailed Algorithms
We present a more detailed and precise explanation of the two main algorithms of the paper, for DCliques construction (Algorithm 3) and to establish a smallworld interclique topology (Algorithm 4).
0.a.1 DCliques Construction
Algorithm 3 shows the overall approach for constructing a DCliques topology in the nonIID case.^{4}^{4}4An IID version of DCliques, in which each node has an equal number of examples of all classes, can be implemented by picking nodes per clique at random. It expects the following inputs: , the set of all classes present in the global distribution ; , the set of all nodes; a function , which given a subset of nodes in returns the set of classes in their joint local distributions (); a function , which given , a set of cliques (set of set of nodes), creates a set of edges () connecting all nodes within each clique to one another; a function , which given a set of cliques, creates a set of edges () connecting nodes belonging to different cliques; and a function , which given a set of edges, returns the weighted matrix . Algorithm 3 returns both , for use in DSGD (Algorithm 1 and 2), and , for use with Clique Averaging (Algorithm 2).
The implementation builds a single clique by adding nodes with different classes until all classes of the global distribution are represented. Each clique is built sequentially until all nodes are parts of cliques. Because all classes are represented on an equal number of nodes, all cliques will have nodes of all classes. Furthermore, since nodes have examples of a single class, we are guaranteed a valid assignment is possible in a greedy manner. After cliques are created, edges are added and weights are assigned to edges, using the corresponding input functions.
0.a.2 Smallworld Interclique Topology
Algorithm 4 instantiates the function interconnect with a smallworld interclique topology as described in Section 5.2. It adds a linear number of interclique edges by first arranging cliques on a ring. It then adds a logarithmic number of “finger” edges to other cliques on the ring chosen such that there is a constant number of edges added per set, on sets that are exponentially bigger the further away on the ring. “Finger” edges are added symmetrically on both sides of the ring to the cliques in each set that are closest to a given set.
Algorithm 4 expects a set of cliques , previously computed by Algorithm 3; a size of neighborhood , which is the number of finger edges to add per set of cliques, and a function least_edges, which given a set of nodes and an existing set of edges , returns one of the nodes in with the least number of edges. It returns a new set of edges with all edges added by the smallworld topology.
The implementation first arranges the cliques of in a list, which represents the ring. Traversing the list with increasing indices is equivalent to traversing the ring in the clockwise direction, and inversely. Then, for every clique on the ring from which we are computing the distance to others, a number of edges are added. All other cliques are implicitly arranged in mutually exclusive sets, with size and at offset exponentially bigger (doubling at every step). Then for every of these sets, edges are added, both in the clockwise and counterclockwise directions, always on the nodes with the least number of edges in each clique. The ring edges are implicitly added to the cliques at offset in both directions.
Appendix 0.B Additional Experiments on Scaling Behavior with Increasing Number of Nodes
Section 5.2 compares the convergence speed of various interclique topologies at a scale of 1000 nodes. In this section, we show the effect of scaling the number of nodes, by comparing the convergence speed with 1, 10, 100, and 1000 nodes, and adjusting the batch size to maintain a constant number of updates per epoch. We present results for Ring, Fractal, Smallworld, and FullyConnected interclique topologies.
Figure 10 shows the results for MNIST. For all topologies, we notice a perfect scaling up to 100 nodes, i.e. the accuracy curves overlap, with low variance between nodes. Starting at 1000 nodes, there is a significant increase in variance between nodes and the convergence is slower, only marginally for FullyConnected but significantly so for Fractal and Ring. Smallworld has higher variance between nodes but maintains a convergence speed close to that of FullyConnected.
Figure 11 shows the results for CIFAR10. When increasing from 1 to 10 nodes (resulting in a single fullyconnected clique), there is actually a small increase both in final accuracy and convergence speed. We believe this increase is due to the gradient being computed with exactly the same number of examples from all classes with 10 fullyconnected nonIID nodes, while the gradient for a single nonIID node may have a slightly larger bias because the random sampling does not guarantee the representation of all classes perfectly in each batch. At a scale of 100 nodes, there is no difference between FullyConnected and Fractal, as the connections are the same; however, a Ring already shows a significantly slower convergence. At 1000 nodes, the convergence significantly slows down for Fractal and Ring, while remaining close, albeit with a larger variance, for FullyConnected. Similar to MNIST, Smallworld has higher variance and slightly lower convergence speed than FullyConnected but remains very close.
We therefore conclude that FullyConnected and Smallworld have good scaling properties in terms of convergence speed, and that the linearlogarithmic number of edges of Smallworld makes it the best compromise between convergence speed and connectivity, and thus the best choice for efficient largescale decentralized learning in practice.
Comments
There are no comments yet.