Deep learning (DL), which refers to a class of neural network models with deep architectures, forms an important and expressive family of machine learning (ML) models. Modern deep learning models, such as convolutional neural networks (CNNs), have achieved notable successes in a wide spectrum of machine learning tasks, including speech recognition , visual recognition  and language understanding . The explosive prosperity and rapid adoption of CNNs by research community are largely attributed to high performance computing hardware, such as GPUs, as well as a wide range of easy-to-use open source frameworks based on GPUs, including Caffe , Torch , Theano . As of writing, the current, official versions of these toolkits can harness multiple GPUs on the same machine, but are unable to use GPUs that are distributed across multiple machines, which limits their practical use to smaller datasets.
On the other hand, several CPU-based distributed systems for deep learning have been implemented. Zou et al.  report the Tencent deep learning platform named as Mariana, which distributes neural network training onto CPU clusters. Google’s DistBelief framework  allows training deep networks on CPU-only clusters with up to 1,000 machines, while Le et al.  later scale up to a cluster of 16,000 CPU cores by exploiting model parallelism and asynchronous SGD. Recently, Microsoft’s Adam  achieved state-of-the-art results on the ImageNet22K classification task, by leveraging distributed systems techniques such as a global parameter server, cache locality, and staleness control between workers. These frameworks demonstrate that there is excellent potential to scale up deep learning using distributed clusters, though they require large clusters with thousands of CPU cores to produce the reported results.
Compared to CPU-based distributed deep learning, parallelization of deep networks on GPU-equipped clusters is more readily available to researchers, since satisfactory speedups could potentially be achieved with a smaller number of GPU cards . However, different from the setting of a single machine with multiple GPUs where near-linear speedups could be trivially realized, scaling up deep learning on multiple GPU-equipped machines faces two major challenges. First, Infiniband networking, which has been responsible for past successes in distributed DL , is not available on most cloud computing platforms and lab clusters, where only commodity hardware with limited network bandwidth is installed. Since GPUs are often orders-of-magnitude faster in matrix-dense computations compared to CPUs, in GPU-based distributed training, gigabytes of parameters are generated per second on each device, waiting to be synchronized across multiple machines. Such a high communication load raises the network communication as the main bottleneck given limited bandwidth of commodity Ethernet. Second, managing the computation and communication in a distributed GPU cluster often complicates the algorithm design. Consequently, more algorithm-specific strategies and dedicated communication protocols are necessary to attain maximum performance when designing GPU-based distributed DL.
In this paper we investigate how existing software frameworks can be adapted to efficiently support distributed GPUs, given that only commodity Ethernet is available. On one hand, instead of building a new DL framework from scratch, our goal is to develop an efficient system engine for distributed deep learning, and thus enhance existing popular single-machine platforms with distributed GPU capability. Transforming an existing framework rather than designing a completely new one has the following merits: First, it preserves the ecosystem better and saves users the effort of making an expensive switch. Second, it enable us to solely focus on designing fast and efficient distributing strategies, at the same time enjoy any algorithmic advantage brought by the third-party DL framework themselves. On the other hand, in contrast to systems that require specialized hardware , we want our solution to effectively harness distributed GPUs installed on commodity servers and connected via Ethernet, so that our software is as accessible as possible to researchers. To this end, we propose an open-source system architecture, Poseidon111Poseidon was initially released in January 2015 along with Petuum v1.0 as an application under the Bösen parameter server, with GPU support added in July 2015. All source codes are available at github.com/petuum/poseidon/., which can be deployed on a variety of cluster configurations (such as CPU-only clusters, or GPU-equipped clusters, or clusters with multiple GPUs per machine). Poseidon makes use of any existing single-machine DL framework, and implements a distributed system layer underneath it, in order to harness distributed CPU and GPU clusters with commodity hardware. In our current implementation, we chose Caffe because of its popularity, while noting that Poseidon’s design is compatible with other CNN libraries such as Torch and Theano.
In order to efficiently distribute DL on GPU clusters, we propose three key contributions: First, we design Poseidon as a hybrid three-level architecture, which allows Poseidon to work on both CPU-only as well as GPU-equipped clusters. Second, we propose distributed wait-free backpropagation
(DWBP), which leverages the chain rule in backpropagation (BP) and the structure of modern CNNs; DWBP improves GPU utilization and balances communication load, by overlapping computation with communication during BP. Third, we develop astructure-aware communication protocol (SACP), which combines a centralized parameter storage with decentralized peer-to-peer broadcasting, to minimize communication overheads. Together, these three components allow Poseidon to address the communication bottleneck in GPU-based DL on commodity clusters — specifically, how to efficiently synchronize parameters across Ethernet networks, particularly when each GPU can generate Gbs of gradients per second. We implemented Poseidon’s distributed layer upon the Petuum distributed ML framework , which provides a bounded stale synchronous parallel (SSP) parameter server  that preserves data-parallel convergence guarantees, and prioritized network bandwidth allocation .
Poseidon significantly reduces the training time required by state-of-the-art CNN models, while still achieving the same quality of convergence and accuracy. Using a cluster of 8 GPU-equipped Ethernet-connected commodity machines, by significantly alleviating the bottleneck issue raised by the limited bandwidth, Poseidon attains almost the same classification accuracy as a single GPU, but is roughly faster when training AlexNet, and faster when training GoogLeNet. These results hold across benchmark datasets of different sizes: CIFAR-10 , ILSVRC2012, and ImageNet 22K 
. For example, on a small task such as CIFAR-10 quick solver (where distributed training might not be expected to perform well), 8-node Poseidon can achieve better accuracy than a single machine, in 1/4-th the time. To demonstrate the scalability of Poseidon, we train CNN classifiers on the ImageNet22K dataset, consisting of 14.2M images in 21,841 categories, and achieve competitive accuracy with state-of-the-art results, in less training time and using fewer machines (e.g. 30% training time and 13% cluster nodes compared to Adam ).
We summarize our main contributions as follows: (1) We propose Poseidon, a scalable system architecture as a general purpose solution for any single-machine DL framework to be efficiently distributed on GPU clusters with commodity Ethernet, by leveraging the Petuum framework  as well as three components: a three-level architecture, distributed wait-free backpropagation, and structure-aware communication protocol. (2) We empirically show that Poseidon, running on a GPU-equipped cluster with commodity hardware and Ethernet, achieves high quality convergence comparable to a single machine, as well as state-of-the-art training speedups on benchmark CNN classification models (e.g. on AlexNet, on GoogLeNet, on CIFAR-10, over a single machine) — even for larger datasets such as ImageNet 22K, Poseidon achieve competitive accuracy as compared to the state-of-the-art results, but using only training time and cluster nodes.
The rest of the paper is organized as follows. In section 2, we review existing works on GPU-based distributed DL. Section 3 covers the basics of neural network models, and briefly introduces some fundamentals of Petuum PS and data-parallel distributed machine learning. In section 4, we present the architecture and key features of Poseidon. Section 5 evaluates Poseidon on multiple standard dataset with regard to efficiency, scalibility and accuracy. Section 6 concludes the paper.
2 Related Work
Because of the demand for faster training of neural networks on ever-larger datasets, several frameworks have been proposed that use multiple GPUs on a single machine. For example, Yadan et al.  show that mixed parallelism yields better speedups over model-only or data-only parallelism in ImageNet classification with 4 GPUs. Similarly, Krizhevsky  also implements mixed parallelism for AlexNet  with 8 GPUs which relies on data parallelism in the convolutional layers and on model parallelism in the fully-connected layers. Facebook’s fbcunn [8, 25] implements both model- and data- parallelism on multiple GPUs. However, the aforementioned frameworks focus on parallelization within a single machine with multiple GPUs, and cannot take advantage of distributed computing environments where GPUs are spread out across a cluster.
Distributed, multi-node GPU-based CNN training is an active area of research. Coates et al.  demonstrated that they could train a 11-billion parameter network on a cluster of 16 GPU nodes using model-parallelism, but their implementation required specialized hardware, such as Infiniband networking. MXNet is an open-source framework for distributed deep learning, that addresses both algorithmic code for DL, which is the role that Caffe plays in this paper, as well as distributed execution, which is the technical focus of this paper. No peer-reviewed results for MxNet are available as of writing.
Our position is to identify reusable systems techniques that can be applied to existing single-machine DL frameworks in order to add value to their mature userbase and software ecosystem. We choose Caffe as our example, but note that our techniques could be used for other single-machine deep learning software such as Torch, Theano. Moreover, we make use of commodity hardware (e.g. machines with 1-2 GPUs and Ethernet networking) instead of specialized hardware that is not readily available from cloud providers or most academic clusters (e.g. Infiniband or machines with GPUs). Through our work, we hope to enable existing popular frameworks to be scaled up to distributed clusters of GPU machines.
Recently, Google released their TensorFlow software for deep learning, which does not currently support distributed GPU training, and does not have peer-reviewed results. As with Caffe, we believe the techniques presented herein could be used to produce a distributed version of Tensorflow. Also of note are several efforts to port Caffe onto the Spark platform, such as SparkNet, which reports a 4-5 times speedup with 10 machines (and hence less scalability than our results herein), as well as a recent, non-peer-reviewed, effort by Yahoo which exclusively uses Infiniband RDMA. In contrast, our focus is on commodity Ethernet that is readily available in most clusters and cloud providers. We see SparkNet in particular as closest to the spirit and intent of this paper; namely, to scale up existing deep learning frameworks with generic, re-usable distributed techniques, and thus add value to their mature ecosystems.
Poseidon builds upon an existing general-purpose system for distributed machine learning algorithms, Petuum, and extends it with new contributions that specifically improve the performance of GPU-based deep learning. In order to clearly delineate our contributions, we begin with a brief overview of the Petuum features that we build upon, and establish some mathematical notations that will be useful in characterizing Poseidon.
3.1 Petuum for Iterative-Convergent ML
Poseidon builds upon Petuum, a distributed big machine learning framework that provides a generic interface to a broad spectrum of ML programs . Its design philosophy is rooted in iterative-convergent
solutions to loss function minimization. A number of ML algorithms are formulated in this manner, which involves repeatedly executing update equations that decrease some error functions. Some notable examples include stochastic gradient descent in optimization programs, MCMC and variational methods for graphical models, and proximal optimization for structured sparsity problems, among others.
In a mathematical form, the iterative-convergent algorithm can be represented as follows. Given data and a loss function , a typical ML problem can be solved by iteratively executing the update equation until the model parameters reaches some stopping criteria.
where denotes the iteration. The update function performs computation on data with model parameters to improve the loss . The intermediate results are aggregated by function .
In large-scale machine learning, both data and model can be very large. In data-parallelism, the data is partitioned and assigned to computational worker machines (indexed by ), whereas in model-parallelism, the model is partitioned and assigned to workers. Since we are interested in data-parallelism, we partition the data into a set of denoting the -th data partition (i.e. often called mini-batch), as shown in Figure 1. Then, the update equation becomes
In each iteration, parameter updates produced by each partition of data are locally computed on each worker, and then are communicated to each other.
3.2 Stale Synchronous Parallel PS
A parameter server (PS) is a distributed shared memory system that provides systematic abstraction for iterative-convergent algorithms in data-parallel distributed machine learning. Typically, PS enables each worker to access the global model parameters via network communications following the client-server scheme. In particular, the training data are partitioned and distributed to a large number of clients (i.e. workers). Data-parallel distributed training can be easily implemented on the PS architecture, by letting the execution of the update take place only on each worker over data subsets therein, and the application of the updates to model parameters take place on the server, and a consistency scheme coordinate the synchronization among server and clients.
In data-parallel ML, iterative-convergent algorithms often enjoy a nice property of error-tolerance, i.e. they still execute and converge correctly even when their model parameters experience synchronization delays, provided that those delays are strictly bounded [9, 5, 15]. The stale synchronous parallel (SSP) consistency model exploits this error-tolerance property, and try to reduce network communication/synchronization overheads substantially by allowing stale parameter updates while the staleness is bounded by a threshold . Integrated with a PS, the SSP consistency model ensures that if a worker reads from server at iteration , it is guaranteed to receive all updates from all workers computed at and before iteration . If this is impossible because some straggling worker is more than iterations behind, the reader will stop until the straggler catches up and sends its updates. For stochastic gradient descent algorithms, SSP has very attractive theoretical properties .
Poseidon’s distributed layer is derived from Bösen , a parameter server implementation that supports SSP consistency model. It allows computations to use stale model parameters (to reduce synchronization overheads), but strictly upper-bounds the number of missing iterations, restoring formal convergence guarantees . Besides Bösen and SSP, Poseidon provides many advanced features that are beneficial for GPU-based distributed deep learning, as covered in section 4.4.
3.3 Data-parallel Distributed Training of Convolutional Neural Networks
A neural network has multiple stacked layers, each of which is filled with different types of computing units inside, and layer-wisely interconnected by real or boolean weight matrices as trainable parameters. The basic computing unit in each layer is called a neuron, which is usually composed of a vector of weights corresponding to a row in the weight matrix, and a nonlinear function to introduce rich model expressiveness. Each neuron takes outputs (activations) from its preceding layer as input, applies both linear and nonlinear transformations to produce its own activation, which is then passed to its following layers as their input. At the bottom of a neural network is an input layer reading and vectorizing different types of data as network inputs, while at the top of the network is usually a loss layer, which are pre-specified by an optimization objective (e.g. a classifier or a regressor).
Convolutional neural networks (CNNs) have both convolutional layers and fully-connected layers as building blocks. A neuron in a convolutional layer is also called a filter, and is connected with a spatial local region of its previous layer’s output (feature maps), and share the same weights across all possible regions. This weight sharing pattern significantly reduces the number of trainable parameters, making them much easier to train and more agnostic to overfitting. A convolutional layer is usually followed by a nonlinear down-sampling layer, such as an max-pooling layer, which partitions the output feature map into a set of rectangles and outputs the maximum value for each such sub-region.
CNNs are trained using the stochastic gradient descent (SGD), which falls into the family of iterative-convergent algorithms. Specifically, training is performed by an iterative algorithm, where each iteration consists of a feedforward and a backpropagation pass. In the feedforward pass, the network takes a batch of training samples as input, forwards from bottom to top layers and outputs a prediction for each sample at the end layer. The predictions are then compared to the groundtruth of training samples at the loss layer to compute the error value. In the backpropagation, the error is propagated through the network in a reverse order, during which the weights in each layer are updated towards the direction where the loss decreases. After repeating a sufficient number of training iterations, the network will usually converge to some state where the loss is close to an optimal, and the training is then terminated.
Accordingly, learning CNNs is another typical distributed ML problem to which the Petuum’s iterative-convergent strategy is successfully applicable. In the CNN training, the update equation Eq.1 reduces to
where the parameter updates are calculated as the gradients of over current data batch , controlled by a stepsize , and the updating function reduces to the additive function as in SGD. We often impose a function , which contains regualization and momentums on the model parameters .
Similarly, in the data-parallel distributed setting, every node holds a replica of the network parameters . At each iteration, every node takes a batch of data , performs a feed forward and back propagation pass, and produces a copy of gradients. Gradients are then communicated, aggregated, and applied to update model parameters as
4 Poseidon Architecture
|Ethernet||Rate(GBit/s)||Rate (Mb/s)||Rate (# floats/s)|
|Model||Batch size (# images)||# parameters (# floats)||Time (s/iter)||Gradients (# floats/s)|
Because GPUs are faster in matrix computations than CPUs, the gradient updates are produced faster on GPUs than they can be naively synchronized over the network, thereby the computations during neural network training are usually bottlenecked by communications, as evidenced by Table 1 and Table 2. In particular, Table 1 lists the standards of commonly used Ethernet and Table 2 shows some statistics of modern CNN training 222The performance is quoted from the official site of Caffe: caffe.berkeleyvision.org/performance_hardware.html and github.com/BVLC/caffe/issues/1317.. Take the AlexNet training as an example: Given a standard solver setting with batch size 256, 61.3 million of gradients will be generated per second on each device. If we distribute the training onto a commodity cluster with 8 nodes each equipped with 1 GPU, ideally the master node need to receive at least 490M float parameters, and then send out another 490M in one second to guarantee that the next iteration of computation on workers will not be blocked. Though adjusting the network configurations (e.g. increasing batch size) may decrease the communication load, the demanded throughput is still far above the maximum throughput that the commodity Ethernet (i.e. 1 GbE and 10GbE Ethernet) can afford333Also note that due to issues related with network protocols and software implementations, the actual performance we could achieve in practice is usually lower than standard values as reported.. Therefore, when distributing DL on GPU clusters, the major challenges are how to quickly collect and aggregate the gradients, and how to efficiently synchronize updated parameters across all workers.
Poseidon presents three key contributions to address these challenges: a three-level hybrid architecture that supports both CPU and GPU computation, a distributed wait-free backpropagation (DWBP) scheme to interleave computation with inter-machine communication, and a structure-aware communication protocol (SACP) that reduces the size of network messages. The three-level architecture improves Poseidon’s generality, by allowing it to work with both CPU- and GPU-based DL software, while DWBP and SACP enable the DL software to communicate quickly and efficiently across the network.
4.1 Overview: A Three-level Structure
Existing systems for distributed deep learning usually exhibit a traditional client-server structure. For example, in previous CPU-based distributed DL systems [2, 6], a two-level parameter server architecture was built, where the first level has server machines collecting gradients and distributing newly updated model parameters to workers, and the second level has worker nodes (threads) taking batches of training data and generating gradient updates. When deploying them onto GPU clusters, one may need to heavily adjust the implementation, to support more sophisticated cluster configurations ( e.g. a cluster of GPU nodes where each node has multiple GPUs), as well as to avoid unnecessary memory access between different types of devices. Moreover, existing architectures only allow connections between server and clients, which limits that the communication can only happen between master and slave nodes.
In order to provide a general solution for both CPU-only and GPU-based distributed deep learning as well as to enable more strategic communication approaches, we design Poseidon as a three-level structure, as Fig.2 illustrates. First, We add an additional hierarchy within each worker node, thus allow multiple client threads coexisting in a single worker machine. This design enables Poseidon to support both CPU and GPU users as well as any system configuration, such as a cluster of nodes where each node has multiple GPUs or CPU cores, by binding each worker thread with a specific device (CPU core or GPU). Second and more importantly, instead of the traditional client-server structure, where each client only connects with the server machine, we design a hybrid topology, where peer-to-peer (P2P) connections between pairs of workers, and server-client connections between the server and workers, are both established. It enables more dedicated communication strategies for parameter synchronization among multiple-GPU nodes, which we elaborate in section 4.3.
Algorithm 1 presents an overview of the distributed training process of Poseidon. At the beginning of training, every worker thread starts its Caffe engine  to perform feedforward and then backpropagation pass for some number of times, via the distributed wait-free backpropagation (DWBP) algorithm (See section 4.2), during which they communicate asynchronously following a consistency model of the bounded stale synchronous scheme , as we briefly introduced in section . The DWBP algorithm enables communication to be overlapped with the error propagation computations. The structure-aware communication protocol (SACP) minimizes communication load by exploiting the layer property of neural nets, and passing or receiving the parameter updates by intelligently choosing the optimal solution from the client-server or P2P pipelines (See section 4.3). In the lower level, the communications are further monitored and operated by a bandwidth manager provided by Petuum Bösen , as we explain in section 4.3.3.
4.2 Distributed Wait-free Backpropagation
Backpropagation (BP)  is the principle algorithm for training neural networks. Specifically, BP algorithm proceeds as a chain, with many feedforward and backpropagation passes. During the back pass, an error message is propagated from the top to the bottom of the network, thus a message passing chain is formed.
Figure.3.(a) shows the process of the original BP in distributed settings on a neural net with layers and layer parameters as . At each iteration , every worker performs the BP computation separately. Only when the propagation reaches at the bottom layer (i.e. all gradients are generated), each worker is ready to start communication. The worker sends out local parameter updates , waits for the remote master node to collect, aggregate and apply the parameter updates from all workers, and then synchronizes with the master node via the network to fetch a new copy of updated parameters for next iteration . Therefore, each worker cannot proceed to iteration until it receives all updated layer parameters ; the computation and communication occur sequentially as shown in Figure.3.(a).
The distributed wait-free backpropagation is designed to reduce the waiting time of parameter synchronizations when backpropagation concurrently executes on multiple machines, so as to improve the GPU utilization. Specifically, leveraging the chain structure of BP, once layer finishes computations and propagates its error message to the preceding layer , its gradients are ready to be sent out, and its parameters are also ready to be updated. This is because each layer in the network occupies an independent set of parameters , and the subsequent computations of lower layers do not affect upper layers any more. Correspondingly, the parameter updating at upper layers does not affect that of lower layers either, because the computations of layer only depend on the error message , which have already been passed.
|Parameters||CONV Layers (#/% )||FC Layers (#/% )|
|AlexNet||2.3M / 3.75||59M / 96.25|
|VGG-16||7.15M / 5.58||121.1M / 94.42|
|FLOPs||CONV Layers (#/% )||FC Layers (#/% )|
|AlexNet||1,352M / 92.0||117M / 8.0|
|VGG-16||10,937M / 91.3||121.1M / 8.7|
Algorithm 2 with illustration of Fig.3.(b) summarizes the DWBP algorithm, whose intuition is to concurrently schedule the computations of lower layers and the communications of upper layers during BP. It exploits the chain structure of the network, and overlaps the communications at upper layers, with the computations at the lower layers. Different from the original BP, the DWBP enforces each layer to start its communication once its gradients are generated, and allows partial parameter updating on the layer. Ideally, when the propagation reaches at the top of the network, both communication and computation are finished, thus the worker can immediately start next iteration.
The DWBP is even more effective in GPU clusters with state-of-the-art CNN architectures, such as AlexNet  and VGG-16 , which stack convolutional (CONV) layers at the bottom, followed by fully-connected (FC) layers at the top. Table 3 shows the statistics about the sizes of parameters and computations in FLOPs for CONV layers and FC layers in AlexNet and VGG-16. FC layers usually occupy more than of the model parameters, indicating communication costs are mostly consumed at the top FC layers, while the CONV layers only take less than of the model parameters but nearly of FLOPs, meaning that computation costs are mostly spent at the CONV layers. As the DWBP overlaps the communication of top layers with the computation of bottom layers, such structure greatly benefits from the DWBP since of working loads on computation and communication are overlapped, thus the waiting time on GPUs significantly reduces and the GPU utilization greatly increases. We implement the DWBP by creating a separate thread for each independent layer, thereby enable concurrent communications and computations for different layers. The effectiveness of DWBP is empirically evaluated in section 5.2.1.
4.3 Structure-Aware Message Passing Protocol
Most ML models, such as neural networks, fall into the family of matrix-parameterized models (MPMs), which represent their parameters as a set of matrices. In data-parallel distributed settings, learning MPMs using iterative-convergent algorithms, as in [2, 6], usually needs to repeatedly push out and pull in the whole parameter matrices. Let us take the AlexNet as an example, the weights between the two FC layers fc6 and fc7 are represented as a matrix as well as its gradients . At each iteration, every worker sends out and then synchronizes updated , which involves heavily communicating two float matrices via the network, as Fig.4.(a) shows. However, the commodity Ethernet only affords maximally several Megabits of data being transmitted per second (as in Table 1). While in practice, the size of parameters to be communicated grows rapidly with the model size, the problem complexity, and the number of nodes in clusters, and GPU-based computing further deceases the per-iteration computation time. Consequently, the parameters to be transferred per second easily exceed the bandwidth of the network, which in turn bottlenecks the computation. To address this challenge, in Poseidon, besides client-server connections between servers and workers, we also allow P2P connections between every two workers, based on which we design a new communication protocol to minimize the number of parameters needed to be communicated by exploiting a nice property of neural networks.
In this section, we first introduce a novel communication approach of Petuum for distributed machine learning, namely sufficient factor broadcasting (SFB) , which exchanges parameters following a P2P scheme. Then we discuss the proposed structure-aware message passing protocol, which is essentially a hybrid communication approach between the centralized parameter server (PS) and decentralized SFB. The SCAP significantly minimizes the communication cost by directly reducing the number of parameters needed to be communicated during neural network training, so as to alleviate the bottleneck raised by limited bandwidth of commodity Ethernet. We conduct internal comparisons and demonstrate the effectiveness of SCAP in section 5.2.1.
4.3.1 Sufficient Factor-based Communication
Some MPMs, including neural networks, enjoy the following structural property: when training using SGD, their gradient over a batch of training samples is a low-rank matrix, which can be casted as the outer product of two vectors and : , where and are called sufficient factors (SFs). Consider the training of CNNs, where is an weight matrix between two FC layers and . In the forward pass, one data sample is fed into the network and the activations of layer is produced as . During BP, the loss is propagated, and an error message , which is an dimensional vector, is passed back from to . The gradients thus can be exactly reconstructed by two vectors and :
Sufficient factor broadcasting (SFB)  is designed to minimize the number of parameters needed to be communicated by leveraging the above property. In a distributed setting with workers, on worker , instead of directly communicating two full matrices and with the master node, we recast it to three steps: (1) Decouple into two vectors and ; (2) Broadcast and to all other peer workers and also receive sufficient factors from them, as Fig.4.(b) shows. (3) Reconstruct using as in Eq.(5), and apply the updates locally on every worker.
Compared to traditional client-server pipeline, SFB can significantly reduce the communication cost in many popular settings. Consider training a CNN with a batch size of . In each batch, every worker needs to broadcast and receive sets of and dimensional vectors to and from workers, respectively, thus in total floats need to be transmitted. While, in a traditional parameter server where the full matrices are sent, the size is ( in modern CNN structures ）. For instance, when training AlexNet on 4 GPU nodes with and for fc6 and fc7, SFB communicates only 18.9M parameters in each iteration, which is times less than communication of full matrices 134.2M.
Microsoft Adam  employs a different SF-based strategy. The SFs from all workers are first sent to the master node following the client-server scheme, then transformed into matrices and aggregated to update model parameters. Then, full parameter matrices are sent back to each worker, as Fig.4.(c) shows. Its communication cost is thus . With the previous example, 75.5M parameters need to be communicated, which is 4 times larger than SFB.
Fig.5 compares the aforementioned three strategies in terms of the number of parameters needed to be communicated between layer fc6 and fc7 when training Alexnet with different number of nodes and batch size. SFB usually outperforms another two strategies with a smaller batch size. One potential drawback of SFB is that its communication cost increases quadratically with the number of nodes, since it employs the peer-to-peer communication scheme.
4.3.2 Structure-Aware Communication Protocol
We propose the structure-aware communication protocol (SACP), which hybridizes the client-server PS scheme with the P2P SFB scheme, for GPU-based distributed deep learning. The SACP is structure-aware, as it intelligently determines the optimal communication method before communicating the parameters, according to the working layer, the SGD batch size, and the number of workers. In particular, for CONV layers, where layer parameters are sparse, SACP takes the centralized server-client PS scheme to directly communicate the parameters via the parameter server. On the other hand, for FC layers where layer parameters are dense and enjoy the low-rank property of MPMs, the SACP chooses between the two SF-based communication (i.e. centralized PS and SFB) according to the batch size and the number of workers. Algorithm 3 summarizes how the SACP intelligently controls the communication.
As complementary to the Algorithm 2, SACP can be synergetically incorporated into DWBP to significantly reduce communication costs as well as improve GPU utilization. Although the SF-based communication may cause extra computation cost due to the reconstruction of gradients from SFs, in GPU based distributed deep learning, such computations are often negligible compared to communication and SF computation.
4.3.3 Bandwidth Management
Poseidon also exploits the Bösen-based communication strategy , a key component of Petuum that maximizes the network efficiency under a given network bandwidth budget (especially in commodity Ethernet ) while minimizing parallel errors. Cooperating with DWBP and SACP, which are aware of the model and cluster structures, the bandwidth manager further incorporates the prior knowledge on the low-level network bandwidth, and maximizes communication efficiency by prioritizing network bandwidth for messages most significant for algorithm progress. Specifically, it communicates model updates and dirty model parameters as quickly as possible without overusing the network bandwidth budget (full network utilization), and allocates network bandwidth according to the messages’ contribution to convergence. In Poseidon, the bandwidth manager lies at the bottom of DWBP and SACP (as shown in Figure 2), and manages the message passing among server and clients regardless of the message types (matrices or SFs).
4.4 Other Features
Poseidon includes features to enhance the usability of the deep learning software system, by addressing issues such as distributed storage and fault tolerance. While not crucial to the performance of distributed GPU-based training, they help to improve the user experience.
Distributed Storage. Poseidon allows both shared and private file systems for multiple cluster nodes, so that the training data can be stored either in a shared file system to be simultaneously accessed by all cluster nodes, or in separate file systems that each node has a separate data partitions, to avoid I/O overload.
Fault Tolerance. Poseidon provides fault tolerance by checkpointing all clients’ model states. Either in the event of failure or as the user specifies, the entire distributed CNN system can be restarted from the last checkpoint exactly, keeping all model/solver states and database pointers unchanged as before.
We first evaluate Poseidon on image classification tasks with benchmark datasets of CIFAR-10  and ILSVRC2012 , and show that Poseidon significantly accelerates the training of modern CNN structures, while guaranteeing the correct convergence, which is important for distributed deep learning. Moreover, we deploy Poseidon on the ImageNet 22K classification, and compare its performance with previously published results such as Adam . Finally, we conduct some internal comparisons to justify the effectiveness of DWBP and SACP.
Cluster Configuration. We conduct all experiments on the PRObE Susitna cluster , where each node has -core 2.1GHz AMD Opteron 6272 CPUs, 128GB of RAM, and NVIDIA Tesla K20C GPU with 4799MB memory. All cluster nodes have shared access to a NFS with 1x Hitachi 1.0 TB HDD and 2x Hitachi 3.0 TB HDD. We use the 40GbE network for both connecting NFS and communication among workers. For software, we use the Caffe version at Oct 2014 with CUDA 6.5 and CUDNN R2. and NVIDIA driver version 340.29.
5.1 Image Classification
We demonstrate Poseidon’s performance on three benchmark datasets ranging from small to large, including the CIFAR-10 , the ILSVRC2012 and the ImageNet22K. The statistics of the datasets are briefly summarized in Table 4.
5.1.1 Classification on CIFAR-10
We first evaluate our Poseidon on the CIFAR-10 dataset, which contains images of classes, with 6K images per class. An official train/test split is provided that 50K images are used for training and 10K for testing. Although CIFAR-10 is a relatively small dataset, we experiment to show Poseidon’s capability on achieving better accuracy than single machine at the same time accelerate the training of small CNNs.
Settings. We employ the built-in cifar10_quick_solver and cifar10_quick_train_test network structure in Caffe444github.com/BVLC/caffe/tree/master/examples/cifar10., consisting of 3 CONV layers and 1 FC layers followed by a 10-way softmax classifier, in total parameters. It converges to a
test accuracy with 4 epochs of training in a single machine without decreasing the learning rate. We deploy Poseidon onto 8 Susitna nodes. As a larger batch size usually hurts the SGD performance, for both settings, we reduce the batch size from 100 to 50 and also slightly decrease the base learning rate from 0.01 to 0.007, while keeping other solver settings unchanged. All CIFAR-10 images are stored in a single LMDB on NFS with shared access to 8 nodes. For better comparison, in the distributed setting, we set the stalenessto zero (i.e. we use BSP consistency model during training).
Performance. Similar to the single machine setting, we train the network to convergence without adjusting the learning rate. The test accuracy achieves nearly . Figure.6(a)-(b) plots how the test error decreases along with training time and iterations for Poseidon on 8 nodes and Caffe on a single node. Under the same setting, the single machine Caffe takes more than times of training time to converge to accuracy, while Poseidon quickly converges to in 19 seconds and attain a higher accuracy in seconds with GPU nodes.
|Dataset||# of Images||Size of images||# of categories|
5.1.2 Classification on ILSVRC 2012
We then experiment on ImageNet ILSVRC 2012, consisting of 1.28 million of training images and 50K validation images over 1,000 categories. Following the standards, we downsample all images to before feeding into the networks, and report the top-1 accuracy on the validation set. These experiments show that Poseidon significantly accelerates the training of modern state-of-the-art CNN architectures at the same time guarantees the correct convergence in a distributed GPU cluster.
. The AlexNet is a de facto standard CNN architecture with 5 CONV layers, 2 FC layers and a 1000-class softmax classifier, in total 61.3 million of parameters. GoogLeNet is a more structural and deeper (22-layer) CNN with only 5 million of parameters by stacking inception modules. For fair comparisons, we employ the open implementations of AlexNet555github.com/BVLC/caffe/tree/master/models/bvlc_alexnet. and GoogLeNet666github.com/BVLC/caffe/tree/master/models/bvlc_googlenet. provided in Caffe. Specifically, the bvlc_alexnet achieves top-1 accuracy after convergence, and the bvlc_googlenet converges to top-1 accuracy, both using just the center crop for testing. In single machine training, for AlexNet, we use the standard solver in Caffe, which trains with a batch size 256 for nearly 70 epochs, during which the learning rate is decreased by dividing 10 for 3 times. For GoogLeNet, we employ the quick_solver, which uses the polynomial learning rate policy, and trains for 60 epochs with batch size set to . In the distributed setting, we deploy both AlexNet and GoogLeNet onto 8 GPU nodes with fully data-parallel training, and keep the network structure and the batch size exactly the same, but change to a more suitable solver setting. Specifically, for AlexNet, we train on 8 nodes for about 60K iterations, with the base learning rate set to 0.005 and decreased 5 times during the whole training. For GoogLeNet, we use a standard step policy by setting the base learning rate to and decrease times during training. Using a single LMDB on NFS bottlenecks training when it is simultaneously read by 8 nodes, thereby we split it into 8 parts and let every node access a separate part to avoid I/O overload.
Performance. Figure.6(c)-(d) and Figure.6(e)-(f) show the performance of training AlexNet and GoogLeNet using Poseidon with a GPU cluster of 8 nodes, compared to single machine Caffe, respectively. For AlexNet, Poseidon attains top-1 accuracy on the validation set after training of 27 hours, with a speedup as compared to single machine Caffe that needs 5 days. For GoogLeNet, Poseidon converges to top-1 accuracy after 130 hours of training, as compared to Caffe, which only achieves top-1 accuracy after 250 hours of training, and after near 350 hours of training on a single Susitna node (Poseidon only needs less than 48 hours to achieve and 75 hours to achieve , with a near speedup), and hard to converge with more than 500 hours of training. Finally, we summarize the convergence speedups of Poseidon in Table.5.
5.1.3 Classification on ImageNet 22K
ImageNet 22K is the largest public dataset for image classification, including 14,197,087 labeled images from 21,841 categories, which is rarely touched by the research community due to its massive data size and complexity. We experiment on ImageNet 22K to demonstrate the scalability of Poseidon. As no official test data exists for evaluation, following previous settings in [2, 6, 16], we randomly split the whole set into two parts, and use the first 7.1 million of images for training and remained for test. Similar to ILSVRC 2012, we resize all images to and report the top-1 test accuracy.
Settings. We design a AlexNet-like CNN architecture; specifically, the CNN takes a random crop from the original image, and forwards it into 5 CONV layers and 2 FC layers before making a prediction. The CNN has convolution filters with sizes and . Similar to AlexNet, the first, second and fifth CONV layers are followed by max pooling layers with size
and stride 2. Two FC layers with 3,000 neurons each are put at the top of the network, followed by a softmax layer to be a 21,841-way classier with 120M parameters and 1.8 billion of connections overall. We train the CNN with fully data-parallelism by equally partitioning and distributing the training data into 8 GPU nodes. The batch size and staleness are fixed at 256 and 0, respectively. The network is trained using the step learning rate policy, with base learning rate set to 0.005 and decreased 6 times.
Performance. Table 6 compares our result to those of previous work on ImageNet 22K, Adam , MXNet, and Le et al. . Note that at this point complete fair comparison between different framework is not possible, because the experiment protocol of ImageNet 22K is not standardized, all the source codes are not fully available yet, and large variations exist in system configurations, models, and implementation details. However, it is clear that Poseidon achieves a competitive accuracy with the state-of-the-arts with shorter training time and less machine resources. Compared to Adam , we only use training time and machines to achieve accuracy with a similar sized model. Promisingly, we achieve a higher training accuracy with 3 days of training using a well-established CNN model — this which compares favorably to MXNet, which uses the whole set of 14.1 million images to train an inception-BN structure  using 4 GPUs in a single machine without network communication, and reports train accuracy after 8.5 days of training.
|Framework||Data||# machines/cores||Time||Train accuracy||Test accuracy|
|Poseidon||7.1M ImageNet22K for training, 7.1M for test||8 / 8 GPUs||3 days|
|Adam ||7.1M ImageNet22K for training, 7.1M for test||62 machines/?||10 days||N/A|
|MxNet ||All ImageNet22K images for training, no test||1/4 GPUs||8.5 days||N/A|
|Le et al.  w/ pretrain||7.1M ImageNet 22K, 10M unlabeled images for training, 7.1M for test||1,000/1,6000 CPU cores||3 days||N/A|
5.2 Internal Comparisons
In this section, we conduct internal comparisons to study the effectiveness of DWBP and SACP in improving the GPU utilization, as well as reducing communication cost for GPU-based distributed deep learning. Besides, we report the speedups on throughput (i.e. number of images trained per seconds) in Fig.8 when training AlexNet and GoogLeNet using Poseidon on 8 GPU nodes with different staleness settings, compared to single machine Caffe.
5.2.1 DWBP and SACP
Since DWBP executes asynchronously in a multi-thread and multi-machine setting, it’s difficult to directly monitor how the communication and computation are overlapped. To measure the improvement by DWBP and SACP, we instead evaluate the speedups on throughput, which is defined as the number of images processed per second given a model and a batch size, compared to the single machine Caffe.
Fig.7 compares the speedups for training AlexNet and GoogLeNet under the following three settings with different number of nodes: (1) w/o DWBP: parallel training with traditional BP and full matrices communication; (2) w/ DWBP: parallel training with DWBP enabled; (3) w/ DWBP + SACP, parallel training with both DWBP and SACP enabled. We follow the standard setting, i.e. we set the staleness to (BSP), and the batch size to for AlexNet and for GoogLeNet 777Different batch sizes will lead to slightly different speedups on throughput.. Obviously, with DWBP to overlap the communication with computation, the waiting time between two iterations is greatly saved, thus the throughput is significantly improved, thereby the GPU utilization ratio is relatively improved. Specifically, as Fig.7.(a) shows, for AlexNet, when training using 8 nodes, DWBP significantly improve the speedup from to , with nearly more speedups. For GoogLeNet with less parameters, DWBP also brings more speedups.
With SACP enabled, the speedup on throughput is further improved. Particularly, when training on 8 nodes, although SACP may bring extra computation costs due to parameter matrix reconstructions, it still greatly increases the speedups of AlexNet training from to , with a improvement. For GoogLeNet with fewer FC layers, SACP provides approximately improvement on the speedup.
It is clear to see that, we will suffer more loss on the throughput when increasing the number of nodes. Specifically, when directly parallelizing AlexNet on a 8-node GPU cluster without any system/algorithm optimization, we suffer a loss in throughput, comparing to the ideally linear speedup. However, with DWBP and SACP enabled, we only suffer less than loss, which makes Poseidon much closer to the linear speedup.
5.2.2 SSP Consistency Model
In this section, we study the efficacy of stale synchronous parallel (SSP) consistency model, which is a unique feature provided by Petuum, on scaling up distributed deep learning. Specifically, we compare the speedup on throughput of training AlexNet and GoogLeNet using Poseidon by varying the value of the staleness threshold , while keeping all other settings fixed. Setting staleness values to zero (i.e. ) leads the consistency management to be bulk synchronous parallelization (BSP), where computation uses local model copies that are synchronized only at the end of each iteration and the next iteration may not start until all machines have received up-to-date model parameters. Therefore, with BSP the learning speed is limited by the slowest machine. Compared to the BSP, a positive staleness value produces a short grace period for parameter synchronization between every two iterations, thus enables us to manage the bandwidth for parameter exchanges according to current bandwidth budget and the dirtiness of the updates.
As seen in Fig.8.(a), for AlexNet training where communication load is quite heavy, if we set a positive value of , the throughput is greatly improved; with 4 nodes, the speedup of the fully BSP () is improved from to (). For GoogLeNet training on Poseidon in Fig.8.(b), a positive value of makes Poseidon agnostic to communication cost i.e. we can enjoy near linear speedups of throughput.
We present Poseidon, a highly scalable and efficient system architecture for large-scale deep learning on GPU clusters. Poseidon is built upon Petuum, thus inherits many functionaries and benefits of Petuum. Its design focuses on efficiently harnessing multiple, distributed GPUs on commodity hardware and Ethernet, in order to maximally scale up existing single-machine DL frameworks with a fully data parallel scheme for distributed deep learning. We empirically evaluate Poseidon regarding of throughput, convergence and accuracy on the image classification tasks with multiple standard datasets, and show that Poseidon is able to achieve state-of-the-art speedups in accelerating the training of modern CNN structures, at the same time guarantee the correct convergence.
-  Bergstra, J., Bastien, F., Breuleux, O., Lamblin, P., Pascanu, R., Delalleau, O., Desjardins, G., Warde-Farley, D., Goodfellow, I. J., Bergeron, A., and Bengio, Y. Theano: Deep Learning on GPUs with Python. In NIPSW (2011).
-  Chilimbi, T., Apacible, Y. S. J., and Kalyanaraman, K. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI (2014).
-  Coates, A., Huval, B., Wang, T., Wu, D. J., Ng, A. Y., and Catanzaro, B. Deep Learning with COTS HPC Systems. In ICML (2013).
-  Collobert, R., Kavukcuoglu, K., and Farabet, C. Torch7: A Matlab-like Environment for Machine Learning. In NIPSW (2011).
-  Dai, W., Kumar, A., Wei, J., Ho, Q., Gibson, G., and Xing, E. P. Analysis of high-performance distributed ml at scale through parameter server consistency models. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (2015).
-  Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large Scale Distributed Deep Networks. In NIPS (2012).
-  Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M. L., Zweig, G., He, X., Williams, J., Gong, Y., and Acero, A. Recent Advances in Deep Learning for Speech Research at Microsoft. In ICASSP (2013).
-  Facebook AI Research. https://github.com/facebook/fbcunn.
-  Ho, Q., Cipar, J., Cui, H., Kim, J. K., Lee, S., Gibbons, P. B., Gibson, G. A., Ganger, G. R., and Xing, E. P. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NIPS (2013).
-  Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional Architecture for Fast Feature Embedding. In MM (2014).
-  Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto, 2009.
-  Krizhevsky, A. One Weird Trick for Parallelizing Convolutional Neural Networks. In arXiv:1404.5997 (2014).
-  Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS (2012).
-  Kumar, A., Beutel, A., Ho, Q., and Xing, E. P. Fugue: Slow-worker-agnostic distributed learning for big models on big data.
Le, Q. V., Monga, R., Devin, M., Chen, K., Corrado, G. S., Dean, J., and
Ng, A. Y.
Building High-level Features Using Large Scale Unsupervised Learning.In ICML (2012).
-  Lloyd, W., Freedman, M. J., Kaminsky, M., and Andersen, D. G. Stronger Semantics for Low-Latency Geo-Replicated Storage. In NSDI (2013).
Mikolov, T., Chen, K., Corrado, G., and Dean, J.
Efficient Estimation of Word Representations in Vector Space.In ICLRW (2013).
-  Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051 (2015).
-  MXNet. http://mxnet.readthedocs.org/.
-  Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning internal representations by error propagation. Tech. rep., DTIC Document, 1985.
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. IJCV (2015), 1–42.
-  Simonyan, K., and Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR (2015).
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR (2015).
-  Vasilache, N., Johnson, J., Chintala, S., Piantino, S., and LeCun, Y. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. In ICLR (2015).
-  Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Managed Communication and Consistency for Fast Data-parallel Iterative Analytics. In SoCC (2015).
-  Xie, P., Kim, J. K., Zhou, Y., Ho, Q., Kumar, A., Yu, Y., and Xing, E. Distributed Machine Learning via Sufficient Factor Broadcasting. In arXiv (2015).
-  Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., and Yu, Y. Petuum: A New Platform for Distributed Machine Learning on Big Data. In KDD (2015).
-  Yadan, O., Adams, K., Taigman, Y., and Ranzato, M. Multi-GPU Training of ConvNets. In ICLRW (2014).
-  Zou, Y., Jin, X., Li, Y., Guo, Z., Wang, E., and Xiao, B. Mariana: Tencent Deep Learning Platform and its Applications. In VLDB Endowment (2014).