Decentralized Federated Learning: A Segmented Gossip Approach

by   Chenghao Hu, et al.
Tsinghua University
Jilin University

The emerging concern about data privacy and security has motivated the proposal of federated learning, which allows nodes to only synchronize the locally-trained models instead their own original data. Conventional federated learning architecture, inherited from the parameter server design, relies on highly centralized topologies and the assumption of large nodes-to-server bandwidths. However, in real-world federated learning scenarios the network capacities between nodes are highly uniformly distributed and smaller than that in a datacenter. It is of great challenges for conventional federated learning approaches to efficiently utilize network capacities between nodes. In this paper, we propose a model segment level decentralized federated learning to tackle this problem. In particular, we propose a segmented gossip approach, which not only makes full utilization of node-to-node bandwidth, but also has good training convergence. The experimental results show that even the training time can be highly reduced as compared to centralized federated learning.


page 1

page 2

page 3

page 4


GFL: A Decentralized Federated Learning Framework Based On Blockchain

Due to people's emerging concern about data privacy, federated learning(...

Federated Learning at the Network Edge: When Not All Nodes are Created Equal

Under the federated learning paradigm, a set of nodes can cooperatively ...

Wireless Ad Hoc Federated Learning: A Fully Distributed Cooperative Machine Learning

Federated learning has allowed training of a global model by aggregating...

Overcoming Forgetting in Federated Learning on Non-IID Data

We tackle the problem of Federated Learning in the non i.i.d. case, in w...

Decentralized Federated Learning Preserves Model and Data Privacy

The increasing complexity of IT systems requires solutions, that support...

Federated Learning for Open Banking

Open banking enables individual customers to own their banking data, whi...

Multigraph Topology Design for Cross-Silo Federated Learning

Cross-silo federated learning utilizes a few hundred reliable data silos...

1 Introduction

Recent years have witnessed a rapid growth of deep learning algorithms which achieve and even transcend the human-level accuracy on nature language processing and computer vision

[7, 9], thanks to the massive amount of data collected. To improve the deep learning performance, it is of great demand for different entities to contribute their own data and train models together. In such collaborative training, the concern about data leakage has motivated federated learning [14], which allows nodes to only synchronize the locally-trained models instead of their own original data.

A general federated learning system uses a central parameter server to coordinate the large federation of the participating workers (workers and nodes are used interchangeably in this paper). The workers train a local model with their own dataset and send the model updates (e.g., gradients or parameters) periodically to a centralized server for synchronization. To reduce the risk of single point failure, a couple of decentralized synchronization methods have been proposed. All-reduce [16] adopts an all-to-all scheme, i.e., each worker sends the local model updates to all other workers. It achieves the same synchronization effect as parameter server but consumes much bandwidth resource between works. When the model updates from all nodes in the system are sent to all other nodes, the performance is highly degraded. To reduce the transmission cost, gossip based model synchronization [6, 8] is proposed: workers send local updates to only one or a group of selected nodes.

In real-world federated learning scenarios, the network capacities between nodes are highly uniformly distributed and smaller than that in a datacenter [17]. Thus, it is still extremely bandwidth costly when workers send the full model updates (e.g., the size can be up to MB in [7]). An intuitive question is then, is it possible for workers to synchronize the model partially, from/to only a part of the workers, and still achieve good training results?

Our answer to this question is a novel decentralized federated learning design, introducing a segmented gossip approach, which not only makes full utilization of node-to-node bandwidth by transmitting model segmentations in a peer-to-peer manner, but also has good training convergence by carefully forming dynamical synchronization gossiping groups. In particular, the details of the design and the contributions are summarized as follows.

First, we propose a model segmentation level synchronization mechanism. We “split” a model into a set of segmentations—subsets which contain the same number of model parameters that are not overlapped with each other. Workers perform segmentation level update by aggregating a local segmentation with the corresponding segmentation from other workers. Based on our analysis, can be much smaller than the number of all workers, to achieve good convergence for the training process.

Second, we propose a decentralized federated learning design, borrowing the idea from gossip protocol; each worker stochastically selects a few workers to transfer the model segment for each training iteration. Our objective is to maximize the utilization of bandwidth capacities between workers. To improve the convergence performance of our solution, we introduce “Model Replica” to guarantee that enough information from different workers is acquired during aggregation.

Third, we implement the model segmentation strategy and the gossiping strategy into a prototype called Combo, and design experiments to evaluate its performance. Our results show that our design significantly reduces the training time in practical network topology and bandwidth setup, with only slight accuracy degradation.

2 Related work

Distributed ML

Conventional distributed machine learning systems are centralized, workers periodically send the local updates to a (a set of) parameter servers (PS), such as SparkNet


, Tensorflow

[1] and traditional federated learning systems [10, 11]. To avoid bottleneck and single point failure, [13, 5] aim to scale PS for better network utilization. Although these scaling methods could increase the accumulative bandwidth at the server side, they are still suffering the long convergence time when the network is poor.

An alternative solution is decentralized architecture, the workers exchange updates directly using all-reduce scheme, with a communication cost for workers. To reduce the huge communication costs, an intuitive approach is to take the advantage of topology. Baidu first introduced Ring-allreduce222, which is a bandwidth-optimal way to do an allreduce. The workers involved are arranged in a ring, each worker sends gradients to the next clockwise worker and receives from the previous one. In this way, it reduces the communication complexity to linear growth in scale. similarly, the tree [12] and graph [2] topologies are proposed to reduce the communication cost. However, these approaches may need multiple hops between workers, resulting in slow convergence. Instead of the topology-based method, Ako [18] propose a partial gradient updates method. In each synchronization round, each worker sends a gradient partition to every other worker. Obviously, Ako reduces the synchronization time and the communication overhead depends on the partition size and the worker number.

Although these existing approaches perform well in distributed ML, they aggregate gradients every epoch, which still face heavy communication cost and is not practical in federated learning with slow internet connections.

Communication efficient FL A main research focus of federated learning is to reduce the communication cost. [11] propose structured updates and sketched updates to reduce the exchange data size at the cost of accuracy loss. [14] propose the federated averaging algorithm (FedAvg) to reduce the parameter updates significantly. FedAvg aggregates parameters after several epochs. In each synchronization round, it selects a fraction of workers and computes the gradient of the loss over all the data held by these workers. These methods are based on the PS architecture, which faces the network congestion when the updates arrive at the PS concurrently.

Gossip protocol in ML The gossip protocol widely used in distributed systems [3, 8], each worker sends out message to a set of other workers, the message propagates through the whole network worker by worker. [4] first introduced the gossip protocol in deep learning. They propose GoSGD, using sum-weight gossip protocol to share the updates with selective workers. The results show good consensus convergence properties. [6] propose GossipGraD, which is a gossip based SGD algorithm for large scale deep learning system and reduces the communication complexity to .

However, in federated learning, network connections between geo-distributed workers usually could not be fully utilized because of the bottleneck, which is ignored in these approaches.

3 Segmented Gossip Aggregation

Now consider the network topology with workers. An all-reduce worker pushes local model replicates to the other workers through links while a gossip worker is expected to push one local model replicate out through only one link. Within a datacenter where the workers are connected by the local area network, they can always communicate with each other at maximum bandwidth thus the gossip worker can achieve great speed up as the transmission size is drastically reduced.

However, in the federated learning context where the workers are geo-distributed, the real bandwidth between the workers is typically small due to the potential bottleneck of WAN. Thus the traditional gossip-based schemes can not make full use of the worker’s bandwidth because the transmissions are limited in one or few links. We propose the Segmented Gossip Aggregation to solve this problem by “splitting” the transmission task and feeding them into more links.

3.0.1 Segmented Pulling

Fig. 1(a) illustrates the transfer procedure with segmented gossip aggregation which we name it segmented pulling. In the aggregation phase, the worker needs to receive the model parameters from others. While the naive gossip-based synchronization schemes require the worker to collect the whole model parameters, segmented pulling allows the worker to pull different parts of the model parameters from different workers and rebuild a mixed model for aggregation.

Let denote the model parameters. The worker firstly breaks the structure of into segments without overlapping such that


For each segment , the worker chooses a peer worker which we denote it as and then actively pulls the corresponding segment from it. Note that this step is parallelized to make full use of the bandwidth. When the worker fetches all the model segments back, a new mixed model can be rebuilt from the segments such that


The naive gossip-based scheme pulls all the segments from a single peer worker. However, with segmented pulling, if we choose a different peer for each segment, the total transmission size is still equal to one model, like the naive gossip-based schemes, but the traffic is dissolved among not one but links.

3.0.2 Model Replica

In traditional distributed ML scenario within the datacenter, the gossip-based solutions can choose only one other worker for aggregation but still achieve excellent convergence, because the workers “gossip” with each other frequently such that the update of each worker are propagated through the whole network before they become too stale [6]. However, for communication efficient FL systems, the staleness of the model updates is hard to bound as the models are trained separately for up to a few epochs.

Thus as a compromise, we set a hyper-parameter Model Replica which represents the number of the mixed model gathered by segmented pulling. To rebuild mixed models, the worker will pull segments from peers. Thus increasing the value of means more segments have to be transferred through the network, which may cause bandwidth overhead. But this is necessary to accelerate the propagation and ensure the model quality. Since there is no centralized server bottleneck, the model training speed could still be faster even with extra transmission.

(a) Segmented Pulling
(b) Segmented Aggregation
Figure 1: Segmented Gossip Aggregation

3.0.3 Segmented Aggregation

Typically the model aggregation uses weighted averaging of the received model parameters with the worker’s dataset size as weight. But in segmented gossip aggregation, the mixed models are patched together from different workers, so it is hard to set a reasonable weight for the mixed model as a whole. For such case, we use a segment-wise model aggregation.

Assume the worker has fetched all the segments and rebuilt mixed models which we represent as . Then for each segment , we have mixed models and one local model to aggregate. Let denote the set of the workers which provide the segment (worker itself is contained too) and denote the dataset size of worker , then we can aggregate segment by:


Combine all the aggregated segments, and we can rebuild the final aggregation result by


And then the worker can continue its training until next aggregation phase comes.

4 Combo Design

In this section, we introduce Combo, a decentralized federated learning system based on segmented gossip aggregation. We firstly present the implementation details of Combo, then discuss how it handles the dynamic nature of FL workers, and finally, we give a brief analysis of the convergence of Combo.

4.1 Implementation Details

As a decentralized FL system, we focus on the design of the workers as the participation of the centralized server is trivial during the training. However, it is important to notice that before the training starts, the server has to initialize the model parameters of each worker with the same value otherwise the training may fail to converge. The server has the information of all the workers, and while initializing the parameters, the worker list is also broadcasted.

A Combo worker follows a stateful training process as illustrated by the numbered steps in Fig.2. At each iteration, the workers (1) update the model with local dataset and meanwhile, (2) send the segment pulling requests to other workers, once the update is finished, they (3) send the segments to the requestors as a response of the pulling requests and when all the pulling requests are satisfied, the workers (4) aggregate the model segments and start next iteration. Next, we describe the implementation details of these steps.

(1) Local Update.

The learning process starts with the worker updating the model with the local dataset. The worker takes the aggregation result of the last iteration as the input model and updates it using stochastic gradient descent(SGD) with the local data. To reduce the communication cost, the local update may contain multiple SGD rounds before the communication with other workers. We denote the communication interval or the number of SGD rounds as

, which, in typical FL systems, could be up to a few epochs.

(2) Segments Pulling. The workers firstly decide how to partition the model. They don’t have to follow the same partition rule, but for simplicity, we assume they partition the model into segments in the same way. For each segment, the worker has to select peers and sends the pulling request, which contains a segment description and a unique identifier of the worker to indicate which part of the model is to be sent and whom it suppose to be sent to.

Each worker has to send segment pulling requests to the other workers, and Combo tries to distribute these requests evenly among all the workers to engage more links and balance the transmission workload. Thus for each request, the target worker is randomly selected from all the other workers without replacement until there is no option left, which means when , all the segments come from different workers. Note that for each iteration, the pulling requests can be sent even before the local update starts; in this way, the target workers can send the segments immediately when the local model is ready.

(3) Segments Sending. The sending procedure is a twin action of the segments pulling. When the worker finishes the local update, it is ready to send its update result to others. Rather than actively pushing the model, the worker only dispatches the model segments according to the received pulling requests.

(4) Model Aggregation. While the worker is providing the model segments to others, it is also receiving the segments it has requested previously. The model aggregation phase is blocked until all the pulling requests are satisfied, then the worker aggregates the external model segments with the local model using (3) and put the aggregated segments together to rebuild the model. With the aggregation result, the worker gets back to the first step and starts the next training iteration.

Figure 2: The architecture of Combo workers

4.2 Dynamic Workers

In the context of federated learning, the participating workers are more likely to be mobile phones and embedded devices, which are often not connected to a power supply and stable network. Thus the workers in FL system are highly dynamic and unstable, and they can join and exit the federation at any time.

Traditional distributed systems adopt the heartbeat packet and time threshold to check the status of the workers. However, these methods are not applicable with the FL system for the following two reasons: 1) The server has to maintain the heartbeat connection with all the participating workers which limits the scalability of the system. 2) The computation times of each worker vary significantly due to the difference in the computing devices and network environment.

Fortunately, the design of Combo allows us to solve this problem decently. If the worker exits accidentally, the pulling requests it sends to other workers can be canceled immediately when the target workers find it unreachable. For those workers who have requested segments from the offline worker, they can monitor the status of the target workers, and once they see the connection with the target worker is lost, they can mark it as offline, resend the request to another worker and stop pulling from the offline worker. If it is a false report due to the network fluctuation or the offline worker comes back, the offline flag can be removed as long as the communication is reestablished.

The participation of a new worker is relatively easy to handle. When a new worker comes to the federation, it first registers itself on the server and requests a worker list. Then it pulls the segments and aggregates them as normal only without its local model. With the aggregation result, it can start the training with its local dataset. When it sends the pulling requests to the target workers, the target worker adds the newcomer to the worker list. Since the new worker sends the pulling requests to many workers in a single iteration, its existence will be quickly noticed by all other workers.

4.3 Convergence Analysis

Generally, the deep learning uses the gradient descent algorithms to find the model parameters that minimize a user-defined loss function which we denote it as

. For the loss function, we make the following assumptions.

Assumption 1.

(Loss function) is a convex function with bounded second derivative such that


In a centralized learning system, the model parameters are updated with the gradient calculated from the whole dataset. However, with the federated settings, the worker updates the model with the gradient of a subset of data and we denote it as . To capture the divergence of these two gradients, we make the next definition.

Definition 1.

(Gradient Divergence) For any worker and model parameter , We define as the upper bound of the divergence between local and global gradients.


For a worker in our proposed system, at iteration , the local model parameter is an aggregation result of the local model and a few mixed models rebuilt from segments. As a contrast, we denote as the aggregation result of all the nodes, which is the output of algorithm. Like the gradient divergence, we define aggregation divergence to measure the aggregation result.

Definition 2.

(Aggregation Divergence) For any worker at iteration , we define as the upper bound of the divergence between partial and global aggregation.


With the above assumption and definitions, we can present the convergence result of Combo.

Theorem 1.

Let denote the global optimum and denote the initial model parameters, worker performs gradient descent on local dataset for times with learning rate and then pull the segments to aggregate, the aggregation result is , the convergence upper bound of Combo is given by


where .

Due to the limitation of the space, we will provide detailed proof in the extended version. Note that this bound is characteristic of stochastic gradient descent bounds that it converges to within a noise ball around the optimum rather than approaching it. The gap between the output and optimum comes from two parts: the gradient divergence and the aggregation divergence . The gradient divergence is related to the data distribution of each worker, which is the inherent drawback of the FL system.

According to the above inequality, the influence of is exacerbated when the communication interval increases. The aggregation divergence can be ameliorated by aggregating more models from other workers. This explains why we set a hyper-parameter to control the model replicas received from others. If we let , the worker aggregates all the external models and the model divergence decreases to zero. In this situation, Combo degrades to the all-reduce scheme and has the same training result as the centralized way. However, we argue that the value of could be much smaller but still maintains the training efficiency, which is then validated in the evaluation.

5 Performance Evaluation

5.1 Setup

We conduct simulation experiments to evaluate our design. The evaluation can be divided into two parts. First, the stateful and synchronous nature of Combo allows us to simulate the training process sequentially, while logically, the training result is the same as the parallelized way. The training traces of each worker are then recorded, which contains the validation accuracies, training iterations, and corresponding synchronization partners. Second, we simulate the network topology and feed it with the training traces to estimate the training time. The specific settings are listed as follows:

Training settings. We train a CNN model on CIFAR-10 dataset to evaluate the training ability of Combo. The dataset consists of 50,000 images for training and 10,000 for validation. The training data are randomly distributed among the workers without overlapping, and the validation data are shared among every worker. The CNN model is adopted from [14], which is considered to be suitable for CIFAR-10 dataset.

The models are trained on each worker using SGD algorithm with the same hyper-parameters, that a learning rate of 0.1 and a batch size of 128. Notice that we adopt a large learning rate simply to accelerate the training speed and it doesn’t affect the comparison results. The synchronization interval is set to 40, which means every worker perform SGD updates on the local model for 40 times before it communicates with others.

Network settings. We simulate a fully connected network topology among the workers. The maximum bandwidth limit of each worker is set to be 100Mbps. Moreover, to simulate the bottleneck of WAN, we set 10Mbps as the available bandwidth between two workers.

Comparison settings. We compare Combo with (1) traditional federated learning system with FedAvg algorithm in which all the workers participate and the server is randomly selected from them, and (2) naive gossip approach without segmentation. To make them comparable, they are all simulated within the same network topology.

The communication behavior of Combo is controlled by two parameters: model segments as and model replica as . In our following experiments, we set and by default, that is in the synchronization phase, the model parameters are flattened and then divided into ten segments equally. For each segment, the worker requests two replicas from other workers. The gossip approach is the special case of Combo when , and it shares the same value of with Combo.

Performance Metrics. The learning performance is measured by the convergence speed. We record the top-1 validation accuracies of the aggregated models at each round and then align the accuracies to the corresponding times. The time is acquired from our network simulator where the local update time is referenced from the real machine time of training with a GTX 1080 Ti graphics card, and the communication time is calculated according to the bandwidth limitations.

5.2 Experiment Result

We first evaluate the convergence speed and scalability of Combo in comparison to the other approaches, then we explore the advantages and disadvantages of the design of model segments and replicas and how they affect the training performance.

5.2.1 Convergence Speed

We present the whole training process over time, as illustrated in Fig.3(a), Combo exhibits an apparent speedup in the convergence without affecting the final validation accuracy. We also explore the scalability of these three methods by comparing the training time needed to reach a predefined accuracy goal with varying number of workers among 20, 30, and 40. According to Fig.3(a), the model reaches convergence around 82% validation accuracy. Since the aim is not achieving the best accuracy and practically speaking, it is not worthy of spending too much time for only 1% or 2% accuracy gain. Thus we set 80% as the accuracy goal.

As illustrated in Fig.3(b), Combo requires the least training time to reach the given accuracy within all three cases and compared with FedAvg algorithm, the speedup of Combo increases from to with the expansion of scale. This phenomenon indicates that the decentralized federated learning is more scalable than the centralized way within a peer-to-peer network.

(a) Convergence with 30 workers (b) Time to reach 80% accuracy
Figure 3: Convergence speed
(a) Convergence with model segments (b) Sync time comparison
Figure 4: Benefit of model segments
(a) Convergence with model replicas (b) Time to reach 80% accuracy
Figure 5: Impact of model replicas

5.2.2 Benefit of Model Segments

The speedup of decentralized approaches comes from the removal of the bottleneck of the centralized server, and the advantage of Combo comes from the benefit of model segments. We train the model with 30 workers, fix and vary from 1 to 10 to investigate how model segments affect the training performance.

Compared with the naive gossip solution, Combo aggregates mixed model parameters made up of multiple segments instead of the complete model. A potential concern is that the result may suffer degradation as the aggregation target is mottled and loses integrality. However, Fig.4(a) shows that the accuracy of the aggregated results at each synchronization iterations is not affected by the model segments at all. Partitioning the model into ten segments() has the same convergence trend as that without partition.

While the model segments do not affect the accuracy at each iteration, the synchronization time is significantly reduced. As illustrated in Fig.4(b), by simply splitting the model parameters into two segments can reduce the synchronization time by half. This is because when , the original transmission quantity is divided into two parts and fed into more links. When the bandwidth is not exhausted, the sending and receiving time can be reduced almost proportionally. However, when , the bandwidth is already fully exploited, increasing the number of segments will not improve the time consumption then.

5.2.3 Impact of Model Replicas

Next, we evaluate the impact of model replica, which controls the overall information quantity that the workers send and receive at each synchronization iterations. Similar to the previous settings, we fix and vary from 1 to 16.

As we discussed in the convergence analysis of Combo, the more information a worker receives, the better aggregation result it will get. When the worker receives all the model replicas from other peers, Combo becomes the All-reduce structure and achieves the same training result as the centralized approach. The analysis is validated by Fig.5(a) that with the increase of the number of model replicas, the accuracy of each iteration becomes better. However, the improvement is not unlimited. We can see that there is no significant gap between and in the convergence trend and result. This reflects the redundancy of All-reduce structure that the worker doesn’t have to collect all the external models to train a high-quality model.

However, as the bandwidth of worker is fully utilized with model segments, increasing leads to the proportional growth of the transmission workload. Thus there exists a tradeoff, a larger increases the convergence rate on synchronization iterations but also the synchronization time. We compare the training time needed to reach 80% validation accuracy with different as shown in Fig.5(b). Increasing to leads to a rapid reduction of the required training time as it drastically reduces the iterations needed to achieve target accuracy goal, which is also illustrated in Fig.5(a). However, if we continue to increase , the growth of the training time exceeds the reduction of the iterations and slows down the convergence speed.

6 Conclusion

One of the most challenging problem of federated learning is the poor network connection as the workers are geo-distributed and connected with slow WAN. To avoid the drawback of high possibility network congestion in centralized parameter sever architecture, which is adopted in today’s FL systems, we explore the possibility of decentralized FL solution, called Combo. Taking the insight that the peer-to-peer bandwidth is much smaller than the worker’s maximum network capacity, Combo could fully utilize the bandwidth by saturating the network with segmented gossip aggregation. The experiments show that Combo significantly reduces the training time and remains good convergence performance.


This work is supported in part by NSFC under Grant No. 61872215 and 61531006, SZSTI under Grant No. JCYJ20180306174057899, and Shenzhen Nanshan District Ling-Hang Team Grant under No. LHTD20170005.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: a system for large-scale machine learning. operating systems design and implementation, pp. 265–283. Cited by: §2.
  • [2] A. Agarwal, O. Chapelle, M. Dudik, and J. Langford (2014) A reliable effective terascale linear learning system. Journal of Machine Learning Research 15 (1), pp. 1111–1133. Cited by: §2.
  • [3] R. Baraglia, P. Dazzi, M. Mordacchini, and L. Ricci (2013) A peer-to-peer recommender system for self-emerging user communities based on gossip overlays. Journal of Computer and System Sciences 79 (2), pp. 291–308. Cited by: §2.
  • [4] M. Blot, D. Picard, M. Cord, and N. Thome (2016) Gossip training for deep learning..

    arXiv: Computer Vision and Pattern Recognition

    Cited by: §2.
  • [5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, et al. (2019) Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046. Cited by: §2.
  • [6] J. A. Daily, A. Vishnu, C. Siegel, T. Warfel, and V. C. Amatya (2018) GossipGraD: scalable deep learning using gossip communication based asynchronous gradient descent.. arXiv: Distributed, Parallel, and Cluster Computing. Cited by: §1, §2, §3.0.2.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv: Computation and Language. Cited by: §1, §1.
  • [8] Z. J. Haas, J. Y. Halpern, and L. Li (2002) Gossip-based ad hoc routing. international conference on computer communications 3, pp. 1707–1716. Cited by: §1, §2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [10] J. Konecny, H. B. Mcmahan, and D. Ramage (2015) Federated optimization: distributed optimization beyond the datacenter. arXiv: Learning. Cited by: §2.
  • [11] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §2, §2.
  • [12] H. Li, A. Kadav, E. Kruus, and C. Ungureanu (2015) MALT: distributed data-parallelism for existing ml applications. pp. 3. Cited by: §2.
  • [13] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server. OSDI’14 Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation, pp. 583–598. Cited by: §2.
  • [14] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas (2017) Communication-Efficient Learning of Deep Networks from Decentralized Data.

    Proceedings ofthe 20th International Conference on Artificial Intelligence and Statistics

    , pp. pp. 1273–1282.
    Cited by: §1, §2, §5.1.
  • [15] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan (2016) SparkNet: training deep networks in spark. international conference on learning representations. Cited by: §2.
  • [16] P. Patarasuk and X. Yuan (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing 69 (2), pp. 117–124. Cited by: §1.
  • [17] A. Vulimiri, C. Curino, B. Godfrey, T. Jungblut, J. Padhye, and G. Varghese (2015) Global analytics in the face of bandwidth and regulatory constraints. pp. 323–336. Cited by: §1.
  • [18] P. Watcharapichat, V. L. Morales, R. Fernandez, and P. R. Pietzuch (2016) Ako: decentralised deep learning with partial gradient exchange. pp. 84–97. Cited by: §2.