Performance Analysis and Comparison of Distributed Machine Learning Systems

09/04/2019 ∙ by Salem Alqahtani, et al. ∙ University at Buffalo 0

Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this problem for synchronous Stochastic Gradient Descent (SGD) training of deep learning models, we developed a performance model of computation time and communication latency under three different system architectures: Parameter Server (PS), peer-to-peer (P2P), and Ring allreduce (RA). To complement and corroborate our analytical models with quantitative results, we evaluated the computation and communication performance of these system architectures of the systems via experiments performed with Tensorflow and Horovod frameworks. We found that the system architecture has a very significant effect on the performance of training. RA-based systems achieve scalable performance as they successfully decouple network usage from the number of workers in the system. In contrast, 1PS systems suffer from low performance due to network congestion at the parameter server side. While P2P systems fare better than 1PS systems, they still suffer from significant network bottleneck. Finally, RA systems also excel by virtue of overlapping computation time and communication time, which PS and P2P architectures fail to achieve.



There are no comments yet.


page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) have dramatically improved the state-of-the-art for many problems that machine learning (ML) and artificial intelligence (A.I) community have dealt with for decades, including speech recognition, machine translation, object identification, self-driving cars, and healthcare record analytics and diagnostics. DNN training is hungry for big data and high computation power, and even high-end machines are inadequate to respond to this demand

[34]. Thus, distributed DNN training architectures/ frameworks that utilize a cluster of computers have quickly become popular in recent years [1, 28]. As these distributed training frameworks need to coordinate the nodes in the cluster efficiently for sharing states, parameters, and gradients, they are confronted with many challenges in terms of consistency, fault tolerance, communication overhead, and resource management [35].

Three distributed training architectures have emerged for DNN training. Parameter server (PS) architecture uses a number of parameter servers that serve to coordinate/ synchronize model updates by a number of workers. The workers pull the model from the parameter-servers, compute on the DNN, and then send the computed gradients to the parameter servers. In the peer-to-peer (P2P) model, worker and server processes coexist on the same machine. The worker process pulls the model locally from the server process in the same machine, computes on the DNN and sends the computed gradients to every other machine. Finally, in the Ring allreduce (RA) model, there is only server process on every machine. The server reads the model from its buffer and computes on the DNN, and sends the computed gradients to its neighbor in the ring.

The developers need to choose among these architectures and configure the framework with the number of workers and servers, depending on the workload and available network and computing infrastructure. While there has been a lot anecdotal evidence that different architectures and different configurations that lead to drastically different performance [4], there has not been a systematic study of communication bottlenecks of these architectures and their effects on the performance of training.

In this paper, we investigate this problem. Since synchronous Stochastic Gradient Descent (SGD) works best for DNN training [6], we make it our focus. We take a two-pronged approach and investigate the three architectures both analytically and empirically.

For analytical assessment of the PS, P2P, and RA architectures, we develop models for latency (total time for training one epoch), which includes computing time, and communication time. The computing time is the time spent to compute the DNN, and the communication time is the time spent to send the training result to a server or servers. Knowing both times is essential to understand the behavior of these systems, and to optimize the overlapping period between both times. Our model analysis shows that the dominant part is often the communication rather than the computation time during training process, and is able to rank the network use/congestion of the three architectures modeled.

To complement and corroborate our analytical models with quantitative results, we evaluate the computation and communication performance of these systems via experiments performed with Tensorflow and Horovod frameworks. To measure the convergence speed, we provide a quantitative evaluation. We perform experiments with PS, P2P, and RA architecture and compare it to our model results. More specifically, we measure the throughput (amount of training samples per second) and latency for large-scale ML systems with TensorFlow system [1] and Horovod system [28]. We choose TensorFlow to take advantage of the high usability and high abstraction level for operations and devices. We use Horovod library to take advantage of MPI features and its integration with TensorFlow. The dataset we feed to our models is the MNIST handwritten digits which is widely used in research.

Our results show that RA achieves high throughput and low latency compared to PS and P2P systems. This is because, in RA the available network bandwidth is constant between worker nodes whereas for the PS or P2P systems, the bandwidth is a shared resources among all worker nodes. We also find that the RA system achieves a high overlapping between computation time and communication time than PS and P2P systems. Finally, we find that P2P and PS systems have load imbalance among peers because tensor size in each DNN layer is different.

Outline of the rest of the paper. We give background on DNNs in Section II. In Section III, we develop a performance models for distributed training for PS, P2P, and RA architectures. We evaluate the performance of these three architectures in Section IV. In Section V, we summarize related work and conclude the paper in Section  VI.

Ii Background

In this section, we explain the neural network training process on a single computer node, and then describe distributed neural network training on multiple nodes.

Ii-a Artificial Neural Network

Artificial neural networks are computing systems for processing complex data input for many ML algorithms [31]. Here, our focus will be on multi-layer Neural networks shown in Figure 1, which is a set of connected input/output to computation units where each connection has a weight associated with it. During the training phase, the network learns by adjusting the weights to be able to predict the correct class label of the input samples. The basic neural network architecture categorized as an input layer, hidden layers, and an output layer. The input layer reads input data instances

, while the output layer holds and displays the results of the neural network. Each set of neurons is grouped in a single layer. A single neuron represents computational unit for input and weight values that neuron receives from a previous layer. The modern deep neural networks architectures aim to train on very large datasets with huge parameters in order to improve the performance for many real-world applications.

To compute DNNs, the first step is feeding the network with weighted edges and a dataset where represents the data instances and represents labels. This step is formally called feedforward neural network, where the data move from the previous layer to the next layer forming no cycle. The denotes the input data at the input layer. Then, neuron will sum the products of its input, weighted values, and bias term .


Every neuron has an activation function for instance

. The total value will pass through a non-linear activation function. If the total value is above the threshold, the neuron will fire, otherwise it will not.


Then, we compare the predicted output value with the ground truth in the training data

, and measure the error using a loss function 

[15]. The loss function is an objective function that should be minimized until the model converges.


After calculating feedforward pass and loss function, the neural networks use a back-propagation algorithm [26] to train neural network model values. The back-propagation computes the gradients by propagating the error (the difference between the targeted and actual output values) to every individual neuron.

Fig. 1: Deep Neural Networks

Ii-B Distributed Neural Network Training

In recent years, the advance of hardware, training methods, and network architectures have enabled distributed training which minimizes the training time for DNNs training. Instead of restricted to a single machine, now we can scale to as many resources as required. In this paper, we choose to perform on data parallel distributed training, but we also explain the model parallel distributed training. Many practitioners avoid using model parallelism because of the network overhead that it creates, due to layers distribution on many machines. Finally, the use of data parallelism models that learn over large amounts of training data are more common than models with billions of parameters [21].

Ii-B1 Model Parallelism

The model parallelism scheme is shown in Figure 2. The parallelization mechanism here is to split the model parameters and among many nodes. Each node is responsible for doing some computation tasks in a different part of the network. The node will communicate the neurons activities with other machines after finishing local computation. The limitations for the model partition training are difficult to partition the models because each model has its own characteristics and high communication latency between devices. However, it is rarely used in the real world applications because of the challenges to get good performance, but it is preferable when a node is not sufficient to store all the model parameters and computationally expensive.

Fig. 2: Model Parallelism.
Fig. 3: Data Parallelism.

Ii-B2 Data Parallelism

In the data parallelism scheme, as shown in Figure 3, each worker machine creates a complete computation graph and typically communicates gradients with a model parameter holder such as PS. The data parallelism scheme is used extensively in many applications due to its simplicity. The data samples are partitioned and assigned across all computation nodes (eg. GPU,TPU) . This is the contrast with the model parallelism, which uses the same data for every worker machine but partitions the model among the worker machines. Each node computes independently from every other node has a subset of the dataset and synchronizes computation results. Mini-Batch SGD is a common approach and shows great performance in many models. Mini-Batch SGD makes the update on a subset of the dataset at each iteration rather than using an entire dataset at each iteration. Each worker trains on different data samples, and exchanges different outputs by network communication with other replicas in the system to update the model until it reaches consensus. Data parallel adjusts the weight values using widely used method called gradient descent algorithm [5] for combining results and synchronizing the model parameters between each worker.


The data parallelization merits are to increase the system throughput through distributed parallel computing and to handle exchange high data volum. However, this approach is limited by the available optimization algorithms and hardware.

Ii-B3 Bulk Synchronous Parallel

In distributed computing systems, each computing node has different computing power than other nodes in the system due to real-world environment. For this reason, the distributed ML training uses an iteration to coordinate the synchronization between all computer nodes [36]. In the synchronous update known as bulk synchronous parallel (BSP) [30], the replicas submit the gradients after locally training process at every iteration or mini-batch to global model parameters or to other replicas. Then, each node stops by a synchronization barrier from training the next iteration until global model receives all results of other active workers. The downside of this approach is that the training time will be dominated by the strugglers and each iteration requires a lot of communication. Also, the workers must enter the synchronization barrier which takes a non-trivial amount of time to exit the synchronization barrier [33]. However, in a synchronous approach, the algorithm converges relatively faster than asynchronous training. The reason is that there is no stale gradient because in each iteration the gradients collect from all replicas and the model update in the next iteration.

Ii-B4 Stale Synchronous Parallel

In the stale synchronous parallel (SSP) [18], the replicas execute their local iterations and go to the next iteration without a synchronization barrier. When the faster machines exceed the slower machines by S iterations which is a threshold, all nodes enter synchronization barrier allowing other machines to be synchronized. All gradients in a given mini-batch are computed and sent to the global parameter model. Then, replicas pull new model parameters with stale gradients before all others sent their update from previous iterations. For example, with replicas, the gradients for some replicas calculated from the stale parameters copies related to previous iterations. The global model parameters update is not more than bounded iterations to reduce the synchronization overhead. The interleaving computation with communication is the greatest benefit for using SSP communications. The algorithm also has a slow convergence rate.

Iii Performance Modeling

To model the behavior of the system, and to estimate the system performance, we present a performance model that captures the computation time and communication latency based on varying system configurations. We deal with the systems at a high abstraction layer due to the complexity of the systems. Large-scale platforms have different underlining designs 

[37] such as TensorFlow, and Spark, and for that, it becomes difficult to design a performance model at a low level. Our model approximates both computation and communication runtime for a single epoch (a single pass through the full training set) of training DNN with mini-batch SGD. In this work, we present a performance model that is simple and accurate enough to calculate computation time and communication latency without intensive log data collection.

Our results have two network indicators, latency (the time it takes to send a message from point A to point B) and throughput (the processed amount of data every time unit). These indicators differ from one system design to another. Modeling network latency has two factors. Download time for workers receive data from the server while upload time for workers send gradients to servers. For further details, the parameters of the performance model PS, P2P, and RA in Table I.

Notation Meaning
Machine Learning Notation
weight variables at layer L
bias variables i at layer L
forward Pass
activation function
ground truth
loss function
Number of training examples
learning rate
batch size
iteration number i
m mini-batch size
W model size
Distributed Systems Notation
PS Parameter Servers
w number of worker
B total bandwidth
TABLE I: Performance mode notation table

Iii-a Distributed Training with PS System

The PS as shown in Figure 4 was introduced in [29], and was followed with second, and third generations [12, 22]. The PS system was built to solve distributed ML elastic scalability, communication, and flexible consistency problems [23]. It is a key-value store dedicated storing variables and does not conduct any computation task. The PS setup might consist of 1PS node or many PS nodes, each of which maintains a subset of ML parameters (weights and bias). The PS adapts one-to-all, and all-to-one collective communication topology for exchanging the gradients and model between servers and workers using different mechanisms such as Google gRPC protocol, a default communication protocol of TensorFlow framework based on TCP [13], as illustrated in Figure 4.

Often, the PS system uses parallelism technique called data parallelism, as described in II-B1 where training dataset splits into small batches called mini-batches that are used to calculate model error and update model parameters.

Dataset and workload are divided equally among all active worker nodes in the system. The PS starts by broadcasting the model to the workers. Each worker performs a neural network computations by reading its own split from the mini-batch, and computing its own gradients. Workers communicate with all PSs to send their training results. The PS incorporates gradients from all nodes and updates the stored model parameters [19]. Many similar systems of PS has adapted key-value store interfaces [11]. The worker nodes use key-value store API to pull the recent parameters from the PS (i.g. pull()), and to push the gradients to the PS (i.g. push()). The PS extends the single server to more than one to solve load balance, and to reduce communication bottlenecks.


The above formula shows the computation time for training DNNs including feedforward time, loss function time, and back propagation time. Most well-know ML framework systems who adapted this architecture design are TensorFlow from Google, MXNet from Amazon.

The PS training is proceeding at the speed of the slowest machine in any iteration with a synchronous model while an asynchronous model overcome strugglers node who degrade the training speed but asynchronous model may affect the general model accuracy. To choose between these communication models, developers and researchers trade-off accuracy and speed which depends on their applications [10]. The downside of the PS becomes a communication bottleneck which leads to slow down the training process and to limit the system scalability in very large-scale. Removing the central server bottleneck in asynchronous distributed learning systems while maintaining the best possible convergence rate is the optimal design solution [10, 9, 8, 2]. The Formula 8 shows how much bandwidth is dedicated for each worker node to communicate with the PS.

Fig. 4: PS Architecture.

Notice that the is the available bandwidth between every client and PS, represents the total bandwidth available for all clients, and indicates all active workers who communicate with PS. , workers should not have less than the number of PS nodes in order to divide the bandwidth evenly among workers.


Where is defined as a delay function for pulling weight values through the communication link. The is the model size. defined the number of times that workers pull the model from PSs in a single epoch. The workers compute the gradients and push the results to the PS that aggregates the gradients after the majority of the nodes communicate their gradients and a new result will be pulled from the workers for the next iteration.


Most frameworks like TensorFlow and MXNet, parallelize the gradient aggregation of the current layer of the neural network with the gradient computation with the previous layer. This optimization hides the gradient communication overhead. In above formula, we calculate the time that takes workers to push the gradients to the PS, where model size divided by the number of workers which is our formula to calculate the gradients size.


The above formulas for calculate approximate computation time for computing neural network which is a learning variable defined and extracted from experiments because each dataset has a different number of features that lead to different computation time cost. is a time that PS takes to update the model with new results from workers. This time is a constant time in our analysis because the number of features in datasets is variant. The communication runtime in this model has linear complexity and time often is overlapped under computation time. Basically, we have three basic blocks, computing time, communication time, and synchronization time. Our performance model for capturing a single PS comparing to actual runtime is shown in Figure 5. If there is more than one PS, each one should maintain a portion of global shared parameters and communicates with each other to replicate and to migrate parameters for reliability and scaling. From , the expected runtime for 2PS is shown in Figure 8. We show the system throughput for one and two PSs in the Figures 6, and 9. The ideal throughput in distributed training increases linearly with number of worker nodes.

Fig. 5: Estimated epoch time for 1PS.
Fig. 6: Measured training throughput of 1PS.
Fig. 7: Epoch time for 1PS.
Fig. 8: Estimated epoch time for 2PS.
Fig. 9: Measured training throughput of 2PS.
Fig. 10: Epoch time for 2PS.

The formula below calculates the ideal samples per second with respect to number of workers.


In Figure 11, we compare the ideal samples per seconds with actual system throughput based on our experiment. The denotes for the single training processing time.

Fig. 11: Comparison of Ideal Throughput with Actual System Throughput for RA, P2P, and PS.

Iii-B Distributed Training with P2P System

In this system design, as shown in Figure 12, every node joins the system is a peer. Peers connect to one another and provide the functionality of saving model parameters and training the neural networks. The first P2P data-parallel system library to solve large-scale ML problems was introduced in [21]. At the high level, both client and server reside on the same machine, which allows replicas to send model updates to one-another instead of a central PS, as shown in Figure 12. Initially, all nodes obtain the same model and subset of the dataset. Each client node calculates feedforward and back propagation passes over mini-batch SGD. At the end of each iteration, the workers push a subset of model parameter updates to parallel model replicas to ensure that each model receives the most recent updates from other nodes. The total epoch time that consists of communication delay of sending gradients for all other peers, receiving model from same machine, and computation time is shown in figure 14. In figure 15, we show the total system throughput that nodes have processed every time unit. In Figure 13, we compare the performance model with the actual running time. One advantage of this P2P model software simplicity because the developers write only one code and distributed on all active machines. However, this approach is limited by optimization algorithms and available hardware. Reading model parameters size and writing gradients size are different because workers read whole model size from the same machine while in writing the workers update subset of the model size through network. Recently, most DNN frameworks overlap computation time with gradients updates.

Fig. 12: P2P based Architecture.
Fig. 13: Estimated epoch time for P2P system.
Fig. 14: Epoch time of P2P system.
Fig. 15: Measured training throughput of P2P system.

Here, we are not interested in one iteration time, but we are interested in an epoch time. is the bandwidth between the peer and other peers. In this model, we noticed that gradients have not perfectly overlapped with computation time and have high overhead. In every iteration, server will send and receive messages. The message size will be equals to .

Iii-C Distributed Training with Ring-allreduce

In this system architecture, as shown in Figure 16, the first paper was introduced in [25]. Uber Inc adapted the baidu RA algorithm [3] and in its Horovod [28] which is a distributed training framework for TensorFlow. In this model architecture, there is no central server that holds the model and aggregates gradients from workers as PS architecture. Instead, in distributed training, each worker reads its own subset data, calculates its gradients, sends its gradients to its successor node on the ring topology, and receives gradients from its node on the ring topology until all workers have the same values. Based on collected log information, there are many types of communications: negotiate broadcast, broadcast, MPI broadcast, allreduce, MPI allreduce, and negotiate allreduce. Also, MEMCPY IN FUSION BUFFER and MEMCPY OUT FUSION BUFFER are to copy data into and out of the fusion buffer. Each tensor broadcast/reduction in the Horovod involves two major phases. First is the negotiation phase where all workers send a signal to rank 0 that they are ready to broadcast/reduce the given tensor. Then, rank 0 sends a signal to the other workers to start broadcasting/reducing the tensor. Second is the processing phase where the gradients are computed after data loading and preprocessing. Both allreduce and MPI Allreduce are used to average the gradients to single value. The inter-GPU or inter-CPU communication and operation whether on a single network node or across network nodes in implementations of parallel and distributed deep learning training are built on top of MPI and can get benefits from all optimizations that related to MPI such as Open MPI [16]. The RA algorithm allows worker nodes to average gradients and send them to all nodes without the need for a PS. The aims of ring reduction operation are to reduce communication overhead that can cause by all-to-one or one-to-all collective communications [27]. Horovod also has less amount of code lines for distributed training and increased the scalability comparing to the well-known PS systems. The RA has used distributed data parallelism scheme (see section II-B2). Every node in the system has a subset of the data . The RA [25] is bandwidth-optimal. Technically, every node of communicates with two of its peers times. During this communication, a node sends and receives chunks of the data buffer. Every node starts with a local value and ends up with an aggregate global result. In the first iterations, each node in a ring topology sends gradients to its successor and receives gradients from its predecessor. Following by a reduction operation that adds up received values to the values in the node’s buffer. In the second iterations, every node has a sub-final block of data. Finally, all-gather operation transmits a final aggregated block to every other node. The bandwidth is optimal with enough buffer size for storing received messages [25]. RA scales independently of the number of nodes as we find in our experiments 18. However, RA is limited by the slowest directed communication link between neighbors and available network bandwidth. The latency in the experiment show near steady bandwidth usage that illustrated 17. RA overlaps the computation of gradients at lower layers in a deep neural network with the transmission of gradients at higher layers, which reducing training time.

Fig. 16: RA Architecture.
Fig. 17: Epoch time of RA.
Fig. 18: Measured training throughput of RA.
Fig. 19: Estimated epoch time for RA.

Iv Evaluation

Iv-a Experimental Environment

Here, we run a set of experiments in distributed ML system introduced in Section II

. In order to provide a quantitative evaluation of 1PS, 2PS, 4PS, RA (Horovod), and P2P Systems, we evaluated the performance of these system architectures with the same basic classification ML tasks. The system performance has two dimensions, latency metric and throughput metric. All of our experiments were conducted in an Amazon EC2 cloud computing platform using m4.xlarge instances. Each instance contains 4 vCPU powered by Intel Xeon E5-2676 v3 processor and 16GiB RAM. We use the MNIST database of handwritten digits 

[20] as our dataset. The MNIST dataset contains 60,000 training samples and 10,000 test samples of handwritten digits (0 to 9). Each digit is normalized and centered in a gray-scale (0 - 255) image with size 28 * 28. Each image consists of 784 pixels that represent the features of the digits. we deployed worker machines from one to seven machines to evaluate and quantify each system throughput and latency. All our ML classification task are written on top of TensorFlow version 1.11.0, an open-source dataflow software library originally release by Google.

Fig. 20: Latency Comparison.
Fig. 21: Throughput Comparison.

Iv-B Experimental Evaluation

We implemented multilayer neural networks with two hidden layers, and we chose the Softmax activation function as the output layer on three system architectures. We did not include in our study the neural network convergence because we believe that it depends on the neural network architecture and hyper-parameters and it has no dependence of distributed computing framework. For all experiments, we fixed the batch size (Batch size =100).

Parameter Server. We have three setups of the PS system. The first setup consists of 1PS and workers ranging from one to seven machines located on separated machines. The PS latency at Figure 7 reduced by adding machines but latency after some point (the fifth machine in our experiment), increased due to all-to-one communication and data overloading which leads to bottleneck on CPU. The nature of a distributed system causes the scalability to degrade after five machines and to increase the epoch time for finishing training cycle. The maximum system throughput as shown in Figure 6 was roughly 10150 images per second on five machines. We notice by adding more machine, the throughput is not increasing due to communication bandwidth saturation, and synchronization barriers. The second and third setups consist of two and four PS respectively with workers ranging from one to seven machines located on separated machines. These suffered from network overloaded because more machines communicate with more machines comparing to 1PS. When we optimize the system with 2PS, the maximum throughput was roughly 13570 images per second on three machines as shown in Figure 9. The latency in Figure 10, on the other hand, increases after three machines.

Peer-to-Peer. We have the number of servers equal to the number of clients as shown in Figure 12. In our experiment, we have seven servers and seven clients co-allocated on seven machines. We have noticed some improvement in latency and throughput comparing to 1PS, 2PS, and 4PS as in Figures 20, and Figure 21. The reason is that the part of model was located on same machine where server located. Also, in this training, we do not need to pull the model from remote machines because it is already updated on the same machine.

Ring Allreduce. In Figures 17, we noticed that the epoch time reduced sub-linearly with number of machines due to many factors like bandwidth independence from number of nodes, computation and communication overlaps.

Ease of Development TensorFlow has low-level and high-level API in Python and C++. Back-end TensorFlow was written in C++ while front-end was written in wide language support such as Python and C++. In distributed training, developers have to write and deploy the code on each machine or have to setup the cloud manager. Both ways need expert knowledge to setup and run the network training. In TensorFlow, there are many functions that are available on the framework but because of the updates and frequent new releases with new features deprecation make developers confused. The TensorFlow provides more APIs and primitives than any other ML framework. Debugging is hard but fortunately, we have computational graph visualizations (Tensorboard) that offer the visualization suite for tracking performance and network topology. The TensorFlow has large community support and widely used in many business and labs. Horovod API is different from TensorFlow in many ways such as simplicity of running distributed training. Developers write the code on one machine and that machine will communicate it to every other machine in the system. The Horovod APIs and primitives are not rich comparing to TensorFlow. Horovod has no clear debugging tool and it uses TensorFlow Tensorboard. Horovod has a lack of support and only few business and people are familiar with the Horovod library. In distributed training, I noticed that there is not one right answer for which architecture should be used. Using PS is good when developers have less powerful and not reliable machines such as cluster of CPU. In TensorFlow, PS architecture is well supported and developers will have a large community for help in debugging and suggestions. On the other hand, Horovod is preferable if developers’ environments have fast devices such as GPU with strong communication link.

V Related Work

Distributed implementation of deep learning algorithms have received much attention in recent years because of its effectiveness in various applications. At present, the usefulness of distributed ML systems such as Tensorflow [1], MXNet [7], and Petuum [32] are recognized in both academia and industry. These open source deep learning frameworks have built on PS architecture. Others systems like [28] has built on RA, or [21] has built on a P2P system. Improving performance of these kind of frameworks has a huge impact on computation resources and training time.

Several works focus on analyzing performance from single system design perspective but for our knowledge, there has not been an in depth comparison study of communication performance with different systems’ architectures. Recently, we have published a comparison study of design approaches used in distributed ML platforms such as TensorFlow, MXNet, and Spark [37]. We focused on system scalability, graph computation speed, fault-tolerance, ease-of-programming, and resource consumptions to identify the difference between these framework designs and their bottlenecks.

The performance limits in Apache Spark for distributed ML applications are scalability and compares with high performance computing MPI framework [14]. With some optimization techniques, Spark implementation has performed on learning task better than MPI framework on an equivalent task. These optimization techniques reduced some of training overhead due to language dependency. However, the best performance and alleviation overhead can come from tuning distributed algorithm and distributed system framework properties. A recent work [17] focused on analyzing DNNs performance by using CNTK framework. The performance model was to capture the scalability of the system while increasing the computation nodes in small and large clusters. The paper concluded that the CNTK suffers from poor I/O that degraded computation time. Main [24] designed called MLNET, a novel communication layer to solve network bottlenecks for distributed ML using tree-based overlays to implement distributed aggregation and multicast and reduce network traffic.

Vi Conclusion

In this work, we presented a comparative analysis of communication performance (latency and throughput) for three distributed system architectures. We found in 1PS, 2PS, 4PS that the throughput fails to increase linearly due to network congestion. We also found that RA achieves better performance due to the efficient use of network bandwidth and overlapping computation and communication.

We hope our study would help the practitioners for selecting the system architecture and deployment parameters for their training systems. Our study can also pave the way for future work to estimate the scalability of distributed DNNs training. A promising direction for future work is to study the trade-offs between network congestion and extra computation. The research question here from the distributed system perspective is to identify which architectures and design elements can facilitate exploring and exploiting these trade-offs.

Vii Acknowledgments

This project is in part sponsored by the National Science Foundation (NSF) under award number CNS-1527629 and XPS-1533870.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §I, §I, §V.
  • [2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola (2012) Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 123–132. Cited by: §III-A.
  • [3] baidu-research (2017-08) Baidu-research/tensorflow-allreduce. External Links: Link Cited by: §III-C.
  • [4] T. Ben-Nun and T. Hoefler (2018) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. arXiv preprint arXiv:1802.09941. Cited by: §I.
  • [5] L. Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §II-B2.
  • [6] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz (2016) Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981. Cited by: §I.
  • [7] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §V.
  • [8] D. W. Cheung, S. D. Lee, and Y. Xiao (2002)

    Effect of data skewness and workload balance in parallel data mining

    IEEE Transactions on Knowledge and Data Engineering 14 (3), pp. 498–514. Cited by: §III-A.
  • [9] B. Cui, H. Mei, and B. C. Ooi (2014) Big data: the driver for innovation in databases. National Science Review 1 (1), pp. 27–30. Cited by: §III-A.
  • [10] H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, et al. (2014) Exploiting bounded staleness to speed up big data analytics.. In USENIX Annual Technical Conference, pp. 37–48. Cited by: §III-A.
  • [11] W. Dai, J. Wei, J. K. Kim, S. Lee, J. Yin, Q. Ho, and E. P. Xing (2013) Petuum: a framework for iterative-convergent distributed ml. Cited by: §III-A.
  • [12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §III-A.
  • [13] W. C. I. Dropbox Inc(Website) External Links: Link Cited by: §III-A.
  • [14] C. Dünner, T. Parnell, K. Atasu, M. Sifalakis, and H. Pozidis (2016) Understanding and optimizing the performance of distributed machine learning applications on apache spark. arXiv preprint arXiv:1612.01437. Cited by: §V.
  • [15] C. M. B. et al (2006) Pattern recognition and machine learning. Springer. Cited by: §II-A.
  • [16] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall (2004) Open mpi: goals, concept, and design of a next generation mpi implementation. In In Proceedings, 11th European PVM/MPI Users’ Group Meeting, pp. 97–104. Cited by: §III-C.
  • [17] S. H. Hashemi, S. A. Noghabi, W. Gropp, and R. H. Campbell (2016) Performance modeling of distributed deep neural networks. arXiv preprint arXiv:1612.00521. Cited by: §V.
  • [18] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §II-B4.
  • [19] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer (2016) Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2592–2600. Cited by: §III-A.
  • [20] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: External Links: Link Cited by: §IV-A.
  • [21] H. Li, A. Kadav, E. Kruus, and C. Ungureanu (2015) Malt: distributed data-parallelism for existing ml applications. In Proceedings of the Tenth European Conference on Computer Systems, pp. 3. Cited by: §II-B, §III-B, §V.
  • [22] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server.. In OSDI, Vol. 14, pp. 583–598. Cited by: §III-A.
  • [23] M. Li, D. G. Andersen, A. J. Smola, and K. Yu (2014) Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27. Cited by: §III-A.
  • [24] L. Mai, C. Hong, and P. Costa (2015) Optimizing network performance in distributed machine learning.. In HotCloud, Cited by: §V.
  • [25] P. Patarasuk and X. Yuan (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing 69 (2), pp. 117–124. Cited by: §III-C.
  • [26] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, D. E. Rumelhart and J. L. Mcclelland (Eds.), pp. 318–362. Cited by: §II-A.
  • [27] A. Sergeev (2015)(Website) Youtube. External Links: Link Cited by: §III-C.
  • [28] A. Sergeev and M. Del Balso (2018) Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799. Cited by: §I, §I, §III-C, §V.
  • [29] A. Smola and S. Narayanamurthy (2010) An architecture for parallel topic models. Proceedings of the VLDB Endowment 3 (1-2), pp. 703–710. Cited by: §III-A.
  • [30] L. G. Valiant (1990) A bridging model for parallel computation. Communications of the ACM 33 (8), pp. 103–111. Cited by: §II-B3.
  • [31] Wikipedia contributors (2018) Artificial neural network — Wikipedia, the free encyclopedia. Note: [Online; accessed 01-Dec-2018] External Links: Link Cited by: §II-A.
  • [32] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1 (2), pp. 49–67. Cited by: §V.
  • [33] E. P. Xing, Q. Ho, P. Xie, and D. Wei (2016) Strategies and principles of distributed machine learning on big data. Engineering 2 (2), pp. 179–195. Cited by: §II-B3.
  • [34] B. Zhang(Website) External Links: Link Cited by: §I.
  • [35] H. Zhang Intro to distributed deep learning systems. External Links: Link Cited by: §I.
  • [36] J. Zhang, H. Tu, Y. Ren, J. Wan, L. Zhou, M. Li, J. Wang, L. Yu, C. Zhao, and L. Zhang (2017) A parameter communication optimization strategy for distributed machine learning in sensors. Sensors 17 (10), pp. 2172. Cited by: §II-B3.
  • [37] K. Zhang, S. Alqahtani, and M. Demirbas (2017) A comparison of distributed machine learning platforms. In Computer Communication and Networks (ICCCN), 2017 26th International Conference on, pp. 1–9. Cited by: §III, §V.