When deployed on the cloud, deep learning (DL) training-as-a-service (TaaS) faces unique challenges: different customers upload their own training data and expect a model of high prediction accuracy returned shortly. Unlike academic researchers, customers have neither expertise nor resources to conduct time-consuming hyper-parameter (e.g., learning rate, mini-batch size) tuning. Hyper-parameter tuning itself is an unsolved and challenging research topic and the tuning process is usually prohibitively expensive [9, 45]. The goal of TaaS is not to provide each user a dedicated hyper-parameter setup and the companioning model, but to provide all users an unified DL model and the common hyper-parameter setup which still delivers the cutting-edge model to customers. As a result, industrial practitioners adopt conservative hyper-parameter setup (e.g., small mini-batch size and small learning rate). On the other hand, a training system that can support such conservative setup can easily support less restrictive setup, which makes hyper-parameter tuning turn-around time much shorter.
In a parameter server (PS) based DL system, such a conservative setup implies high-frequency communication with the PS. We provide a detailed analysis of the communication bandwidth requirement for real-world commercial workloads in Section 2.2 and Section 3. The analysis shows that the bandwidth requirement is beyond the capacity of advanced communication hardware (e.g., RDMA). Furthermore, with faster GPU devices and more efficient software library support, such as cuDNN , the communication cost of exchanging models between parameter servers and learners start to dominate the training cost, which renders scale-out training unattractive.
In addition, the staleness issue inherent in distributed deep learning systems makes scaling-out deep learning less cost-effective when training over a dozen GPU learners [36, 51]. As a result, one of the largest public-known commercial deep learning training system  uses 8 GPUs. Scale-up deep learning training becomes a favored approach in the TaaS scenario.
Although powerful scale-up servers, such as NVIDIA DevBox, provides hardware platforms to improve the training
performance, our evaluation (detailed in Section 6.3) reveals that
the state-of-the-art scale-up software solutions are unable to make the best use
of underlying hardware. Two representative open-source scale-up deep learning
frameworks are mpiT, an MPI-based ASGD111 Asynchronous Stochastic Gradient Descent (ASGD) is defined in
Asynchronous Stochastic Gradient Descent (ASGD) is defined in Section2.1 framework; and DataParallelTable (DPT), a nccl-based (nccl is a high-performance in-node GPU collectives implementation) SSGD222Synchronous Stochastic Gradient Descent (SSGD) is defined in Section 2.1 framework. The former solution incurs unnecessary memory copies between GPUs and MPI runtime and is unable to implement the lock-free update as proposed in HogWild!. The latter solution incurs synchronization barriers by forcing all GPUs to operate in lock-steps, which leads to the straggler problem. Furthermore, none of these two solutions provide a fault-tolerance mechanism, which makes them undesirable for the commercial adoption.
To solve these issues, we introduce GaDei, a high-performance scale-up parameter server design, aiming to efficiently coordinate the model synchronization among GPUs located on the same machine. GaDei strives to pipeline the entire model synchronization, overlapping all the model training and data movement phases to eliminate GPU stalls. Specifically, GaDei implements three system optimizations: (1) Communication via minimum memory copies (2) Lock-free Hogwild! style weights update rule (3) On-device double buffering, along with GPU multi-streaming, to pipeline model trainings and parameter movements. GaDei enables training with small mini-batch size, which mitigates the staleness issue to guarantee model convergence. By evaluating GaDei on a diverse set of real-world deep learning workloads, we demonstrate that GaDei is able to efficiently exploit the bandwidth offered by the commodity scale-up servers, providing faster convergence with significantly higher training speedups compared to existing open-source solutions, such as mpiT and DPT.
Overall, this work has made the following contributions:
We have identified the key challenges in designing a training-as-a-service system: Hyper-parameters must be set conservatively (e.g., small mini-batch size and high model communication frequency) to guarantee model accuracy.
We have designed and implemented GaDei, a highly-optimized parameter server system, to deliver scale-up and resilient training for TaaS workloads on multi-GPU servers. GaDei enables efficient multi-learner training for arbitrary type of neural networks (e.g., CNN, RNN). The design principle of GaDei is independent from the underlying gradient-calculation building-blocks and can complement any open-source DL frameworks (e.g., Torch
, and TensorFlow).
We have proved that GaDei’s system design guarantees both model convergence and deadlock-free. To the best of our knowledge, GaDei is the only scale-up parameter server design that provides both fault-tolerance and deadlock-free guarantee. We have systematically evaluated GaDei’s performance by using 6 deep learning workloads with 3 state-of-the-art deep learning models. Evaluation results demonstrate GaDei often outperforms state-of-the-art solutions by an order of magnitude.
2 Background and Motivation
In this section, we introduce the background and define the terminologies (in bold font) used in this paper. Then we describe the characteristics of TaaS workloads. Finally, we theoretically justify why our design choice guarantees the acceptable model accuracy.
2.1 Terminology Definition
In essence, deep learning solves the following generic optimization problem
where is the parameter (or weights
) vector we are seeking,is the number of samples, and
is the loss function for thesample. is typically in a form of a multi-layered neural network. Stochastic Gradient Descent(SGD) is the de facto algorithm to solve the deep learning optimization problem. SGD iterates over each training sample and applies Equation 1 to update weights. In Equation 1, is the iteration number, is the k-th parameter, is the differential operator, and is the learning rate. Using a large learning rate may converge faster but it may also overshoot so that it does not converge at all, thus using a smaller learning rate is a safer choice in the production run. SGD passes through the entire training dataset several times until the model converges. Each pass is called an epoch (denoted as ). To improve computation efficiency, one can group a number of samples (i.e., a mini-batch, the size of one mini-batch is denoted as ) and apply Equation 1 to update weights for the -th mini-batch.
To accelerate deep learning training, practitioners usually adopt the Parameter Server (PS) architecture as illustrated in Figure 1. Each learner retrieves a mini-batch of training samples from data storage, calculates gradients and sends the gradients to the PS. PS then updates its weights using received gradients. Before calculating the next gradients, each learner pulls weights from the PS. We use to represent number of learners. Two most-widely adopted parameter server communication protocols are Synchronous SGD (SSGD) and Asynchronous SGD (ASGD). In SSGD, the PS collects gradients from each learner and then updates weights following the rule defined in Equation 2. SSGD is mathematically equivalent to SGD when is equal to the mini-batch size used in SGD.
SSGD is not computationally efficient because the PS stalls the learners until it finishes collecting the gradients from all learners and updating the weights. ASGD relaxes such constraints by applying gradient update rule defined in Equation 3.
In ASGD, whenever PS receives a gradient from any learner, PS starts updating its weights. ASGD has the obvious runtime performance advantage over SSGD because learners do not wait for each other to start communicating with the PS. On the other hand, PS and the learners see different weights. PS always has the most up-to-date weights and the discrepancy between the weights used in a learner and the weights stored on PS is measured by staleness. When PS updates the weights, it increments the weights’s (scalar) timestamp by 1; staleness is defined as the difference between the timestamp of the learner’s weights and the timestamp of PS’s weights. A large staleness can cause learners to mis-calculate the gradients, which leads to a slower convergence rate[17, 19, 51, 46]. We explain why training with a small mini-batch size can effectively reduce staleness in Section 2.3. Additionally, several recent works [50, 22, 36, 51] demonstrate that ASGD can converge to a similar model accuracy () as SGD after training with the same number of epochs, when the staleness is bounded in the system (typically up to a dozen of learners in the system). Assuming is the time for a single learner SGD algorithm to train epochs and is the time for ASGD algorithm to train epochs, ASGD speed up is defined as . For a fair comparison, one must also certify that the model accuracy trained by ASGD is similar to that of SGD after epochs.
2.2 Characteristics of the TaaS Workloads
In this section, we detail the characteristics of the training-as-a-service workloads by studying IBM Watson’s natural language classification (NLC) service, which is the most popular service on IBM Watson’s cognitive computing cloud and used by thousands of enterprise-level customers globally.
The NLC task is to classify input sentences into a target category in a predefined label set. NLC has been extensively used in practical applications, including sentiment analysis, topic classification, question classification, author profiling, intent classification, and even bug detection, etc. State-of-the-art method for NLC is based on deep learning[33, 41, 23, 27, 29]. Figure 2 illustrates the deep learning model used in the NLC service.
NLC service is deployed as ”training-as-a-service” in the cloud. After customers upload their in-house training data to the cloud, the NLC model training will be triggered in the background. The NLC model is ready to use after the training completes. Hence from the customer perspective, the turn around time is the model training time. Although the deep learning brings superior classification accuracy, one known issue is the time-consuming training phase, which can severely affect customer’s experience. Minimizing the training time has become IBM Watson NLC’s top priority.
In order to improve the performance for neural network training, previous work has attempted to use scale-out frameworks to coordinate learners distributed on different computing nodes. Good speedup on image recognition tasks like ImageNet has been reported. However, this type of tasks does not represent the commercial TaaS workloads. By examining the real TaaS workloads, we find that corpus size of training sets are generally less than 10k for most use-cases and usually they come with a diverse set of labels. The reason is that in practice annotating training data is expensive for most customers. In addition, we have identified the following characteristics that are critical to the quality of trained model:
(1) Large batch sizes can incur significant accuracy loss: It is well-known that using large mini-batch can improve GPU utilization and incur less demand for communication; however, this runtime performance improvement must not sacrifice model accuracy. Our field study reveals that the deep learning method used in this paper is on average 3%-6% more accurate than other much less computation-intensive methods (e.g., SVM) for NLC tasks. Therefore, an accuracy loss larger than 3%-6% will invalidate the use of deep learning for NLC tasks. To study the impact of different batch sizes on model accuracy, we use 4 representative NLC datasets (the detailed description of each task is given in Table 5) and evaluate the accuracy under different batch sizes in Figure 3. The batch size is increased from 1 to 128. Experimental results show that using large batch sizes would result in unacceptable accuracy loss for three out of four cases. When using large mini-batch size, models for challenging workloads (a large amount of labels and little training data for each label) such as Joule and Watt do not even converge. Other researchers have also observed that large batch size slows down convergence [48, 35, 28].
(2) Low communication frequency decreases model quality: Communication bandwidth is typically much lower in the cloud than in the HPC systems. One way of mitigating the communication bottleneck is to allow each learner to process many mini-batches before it synchronizes with the parameter server. However, although less-frequent communication can efficiently increase the GPU utilization, but it can severely decrease model accuracy as shown in Table 1. Intuitively, less frequent communication causes higher discrepancy (i.e., staleness) between learners and the PS and it will decrease model accuracy. Therefore, learners should communicate with the PS as frequently as possible. Ideally, each learner shall communicate with the PS after each mini-batch training.
(3) Conservative hyper-parameter configuration is imperative: Previous research focuses on how to speedup training for one specific dataset and heavy hyper parameter tuning is required to achieve the best possible accuracy. For example, Table 2 shows the typical hyper-parameter setups for training CIFAR and ImageNet with AlexNet model. The setups vary greatly for different datasets. In contrast, the TaaS users have neither expertise nor resources for hyper-parameter tuning. Customers just upload the training data and then expect a well trained model to be ready in a short period of time. As a result, the hyper-parameters have to be preset to fixed values to cover diverse use cases. Table 3 describes the hyper-parameter used in NLC. For thousands of different datasets, NLC adopts a much simpler and more conservative setup. Note this is a significant difference from commonly evaluated workloads (e.g., CIFAR and ImageNet), where hyper-parameter tuning is specific to each dataset/model and usually is a result of a multi-person-years effort[9, 45]. In TaaS, conservative configurations with small batch size, high communication frequency, and small learning rate are commonly adopted to satisfy a wide range of users.
|Number of epochs||100||40|
|Number of epochs||200|
In summary, TaaS workload characteristics study suggests that adopting the scale-out solutions such like distributed parameter server based frameworks is unsuitable for a set of industry deep learning tasks that require small mini-batch size and high frequency of model exchange.
2.3 Theoretical Justification of Using Small Mini-batch
In the previous section, we have empirically demonstrated training with large batch size can cause unacceptable accuracy loss. Based on a recent theoretical study , we now justify that why training with small batch size can counter system staleness and is desired for distributed deep learning in general.
 Under certain commonly used assumptions, if the learning rate is chosen in the optimal way and the staleness is bounded by
where is the total number of epochs and is the mini-batch size, then the asynchronous parallel SGD algorithm converges in the rate
The result suggests the tolerance of the staleness relies on the mini-batch size . First note that the convergence rate is optimal. The prerequisite is that the staleness is bounded by . The staleness is usually propositional to the total number of learners. To satisfy the condition in (4), either should be small enough or the epoch number should be large enough. In other words, given the number of learners and the total epoch number (or the total computational complexity), small mini-batch size is preferred. In addition, it also explains why small mini-batch size is potentially preferred even for SGD (running on a single worker), since SGD with mini-batch size can be considered as running Async-SGD with mini-batch size with workers (it implies that the staleness is ). Thus, SGD with large mini-batch size is equivalent to Async-SGD with a small mini-batch size but a large staleness.
3 Communication Bandwidth Requirement in TaaS
In this section, we measure the computation time for different mini-batch sizes(i.e. ) over NLC workloads and public image classification workloads. We then calculate the minimum memory bandwidth requirement to achieve any speedup when ASGD protocol is employed. Finally, we demonstrate why none of the existing scale-out or scale-up solution can accelerate NLC workloads.
Each learner’s execution loop consists of three components: (gradient calculation), (pull weights), (push gradients). Each parameter server’s execution loop contains three components: (receive gradients), (apply weights update), and (send weights). When is large, ; when is small, time spent on PS becomes the critical path. In ASGD, sending weights and receiving gradients operations may overlap. To achieve any speedup, we then must have 333(i)Apply update and receive gradients cannot overlap, since apply update can only start when gradients are fully received (ii) Assuming learner can push gradients and receive weights instantaneously. Note that the apply update operation is memory-bound level 1 BLAS operation. Combined memory bandwidth between GPU and CPU (gradients transfer) and memory bandwidth used in CPU DRAM (weights update) are of the same order of magnitude. Further, gradients and weights are of the same size. We now can infer the required overall communication bandwidth to observe any speedup is at least . For NLC workload and image recognition workload, Table 4 records Training time Per Epoch (TPE), Training Samples number (), Model Size; and calculates the minimum Required Bandwidth (RB) to observe any speedup.
is mini-batch size, TPE stands for Time Per Epoch, RB stands for Required Bandwidth. *Both CIFAR and ImageNet use models that use Batch Normalization (BN), which requires.
Why a scale-out solution will never work ? From Table 4, it is easy to see a 10GB/s bandwidth network is required to achieve any speedup for NLC workloads with the appropriate mini-batch size. In addition, to achieve -fold ( 1) speedup, we need to multiply RB by a factor of . Such a demanding bandwidth is beyond the capacity of advanced network techniques (e.g., RDMA). Note RB is also quite close to the peak memory bandwidth (e.g., PCI-e, DRAM), which indicates any extra memory copy may make speedup impossible. Thus, it is natural to infer that the only viable PS architecture is a tightly coupled multi-GPU system collocated on the same server that minimizes data copies and enables learners to asynchronously push gradients and pull weights.
Why existing scale-up solutions are insufficient ? Among popular open-source deep learning frameworks, Caffe, Torch and TensorFlow support multi-GPU training on the same node. However, they are designed for tasks where heavy hyper-parameter tuning is allowed so a larger mini-batch size may be appropriate (e.g., 256). It is easy to see from Table 4 that it requires much higher communication bandwidth to support a small mini-batch than to support a large mini-batch. In addition, Caffe and TensorFlow only support SSGD on one node. We have demonstrated in Section 2.2 that some of the workloads require mini-batch size to be as small as 2, which means Caffe and TensorFlow can at most make use of 2 GPUs (e.g., each GPU works with a mini-batch size of 1). Torch is the only open-source DL framework that supports both SSGD (via DPT) and ASGD (via mpiT) on a single-node. However, as demonstrated in Section 6.3, neither DPT nor mpiT can efficiently use the memory bandwidth on the same node. Furthermore, none of the existing solutions provide a fault-tolerance mechanism in the scale-up setting.
4 Design and Implementation
4.1 Overall Design
GaDei strives to minimize memory copy and enable high-concurrency to maximize communication bandwidth utilization. Figure 4 depicts its design. To minimize memory copy, PS and learners use a shared-memory region to exchange gradients and weights. Each learner has a fixed number of slots in the producer-consumer queue, thus the entire system can be viewed as 444 is the number of Learners. single-producer-single-consumer queues. PS updates weights in place (i.e., HogWild! style). We use 4 openmp threads and unroll weights update loop 8 times to maximize DRAM throughput. To maximize system concurrency, each learner creates two additional threads – push thread and pull thread. On the same GPU device, each learner maintains an on-device gradient staging buffer; so that after a learner finishes gradient calculation it can store the gradients in the buffer, and continue the next gradient calculation without waiting for the completion of push. On-device memory bandwidth is usually several hundreds of GB/s, which is much faster than device-host memory bandwidth (typically 10 GB/s). By buffering gradients on the same device, learner can train continuously while the push thread is pushing gradients to PS. Similarly, each learner also maintains a weights staging buffer on the same device. Learners do not communicate with each other, they only communicate with parameter server.
Figure 5 details the necessary logic of PS and learners. We use the following naming conventions: variables that start with ’g_’ (e.g., g_env, g_param_ptr) represent shared variables between PS and Learners, other variables (e.g., pullCnt, pushCnt) are shared variables between learner’s main thread and its communication threads. PS (Figure (a)a) iterates over the gradient queues in a round-robin fashion and it busy-loops when all the queues are empty. PS does not yield CPU via a conditional variable wait, because PS demands the most CPU cycles to process gradients and it is beneficial to have PS takeover gradients whenever they are ready. Figure (c)c illustrates the logic of learner main thread. It calculates gradients on GPU’s default stream. The learner main thread communicates with push thread (Figure (d)d) via a producer-consumer queue (Line 105 - 107 in Figure (c)c and corresponding Line 203-206, 220-222 in Figure (d)d) of size 1, i.e. variable pushCnt in Figure (c)c and Figure (d)d alternates between 0 and 1. The push thread operates on a separate stream so that it can send gradients in concurrent with the learner thread calculating the gradients. Similarly the learner main thread communicates with the pull thread (Figure (b)b) via a producer-consumer-queue (Line115-125 in Figure (c)c and in Figure (b)b) of size 1, i.e., variable pullCnt in Figure (c)c and Figure (b)b alternates between 0 and 1. If the weights in the learner thread is current (i.e. has the same timestamp as the weights on PS), pullThd skips the pull request in this iteration. By default, the PS updates weights at Line 7 in Figure (a)a in a lock-free fashion (e.g., an incarnation of HogWild! algorithm). GaDei also supports protecting weight updates from concurrent pulling via a read-write-lock.
Note that invoked during the execution is to make sure the memory was flushed in place (either between host and device or within the same device) w.r.t the same stream. As a result, the corresponding memory copy between CPU/GPU or within GPU strictly follows programming order, which is necessary for our protocol verification, and which will be described in the next section.
4.2 Verification of GaDei’s communication protocol
It is difficult to detect, avoid and fix concurrency bugs. GaDei relies on heavy communication between CPU threads, CPU-GPU interaction, and multi-stream operation within the same GPU device. It is imperative to verify the correctness of its communication protocol. In this section, we prove GaDei is deadlock-free in Theorem 2 and verify GaDei’s liveness property in Theorem 3.
The value of pushCnt and pullCnt can only be 0 or 1. For the -th learner, g_env[idx].cnt g_env[idx].sz.
We will show that pushCnt and pullCnt cannot be larger than or smaller than . Line (Fig. (c)c) is the only place where pushCnt can be incremented. If pushCnt can be larger than 1 then there must be an iteration where pushCnt is before executing line , since the increment is 1 per iteration. However, this is impossible because line is not reachable due to the condition check loop at line . Similarly, line (Fig. (d)d) is the only place where pushCnt can be decremented. It cannot be smaller than due to the loop at line . Therefore, pushCnt can only be or . The claims about pullCnt and g_env[idx].cnt can be proved in the same way. ∎
Once signaled, a thread blocked by a condition wait (line 106, 117, 205, 209 or 305 in Fig. 5) will wake up and exit the corresponding condition check loop.
We will show the loop conditions do not hold when signals are sent. For example, only line (Fig. (d)d) can wake up the condition wait on pushEmpty at line . According to Lemma 1, pushCnt can only be or . When sending the signal, pushCnt is always (due to line ) and thus invalidates the loop condition at line . Therefore, the training thread will exit the condition check loop. Similarly, we can prove the claim is true for other condition waits. ∎
The wait-for graph formed by condition waits only, i.e. pthread_cond_wait, is acyclic and thus deadlock-free.
The directed edges in Fig. 6 represent the wait-for relations among wait and signal statements. The edges in blue and green form two cycles. We will show some edges in a cycle cannot exist at the same time.
In the blue cycle, the edge from line - to represents that the push thread waits until pushCnt != 0. The blue edge from - to indicates the training thread waits until pushCnt != 1. Given that pushCnt can only be or (Lemma 1), the above two conditions cannot be true at the same time. Therefore, these two edges cannot exist together and the cycle in blue is infeasible. In the green cycle, the edge from - to illustrates the pull thread waits until pullCnt = 0. The edge from - to indicates the training thread waits until pullCnt = 1. Similarly, this cycle is infeasible too. ∎
Mutex lock operations in Fig. 5 are deadlock-free.
Now consider operations based on mutex locks in Fig. 6. One necessary conditions for deadlocks is hold-and-wait . If threads are not holding one resource while waiting for another, there is no deadlock. However, only the push thread can be in a hold-and-wait state (lines , - and ). Therefore, mutex lock operations cannot introduce deadlocks. ∎
GaDei is dealock-free.
Lemma 3 and 4 show that neither condition waits nor mutex lock operations can cause deadlocks. Now we consider them together. In Fig. 6, condition waits at -, -, - and - are not in any atomic region and thus are isolated from mutex locks. Hence, they cannot introduce deadlocks. The only remaining case is the condition wait at -. It is guarded by lock pushMtx and waits for the signal sent at . However, line is protected by a different lock gradMtx. This doesn not satisfy the hold-and-wait condition. Therefore, it cannot lead to deadlocks either. ∎
The parameter server thread processes each gradient exactly once.
The push thread of learner idx shares gradients with the parameter server using a share array g_env[idx]. Essentially, it is an array-based FIFO queue, where g_env[idx].cnt indicates the number of gradients in queue. It is straightforward to see neither the push thread nor the parameter server can access the same memory location in two consecutive iterations.
Similar to the proof for Lemma 1, we can show that so that is the max size of the queue. In addition, as the parameter server only processes a learner’s gradients if its queue is not empty (lines -), the read pointer () can never be ahead of the write pointer (). So, the parameter server cannot read an outdated gradient and use it more than once. Similarly, when the queue reaches its max size, the push thread must wait (lines -). So, the push thread cannot overwrite unprocessed gradients in the queue. ∎
In GaDei, each learner and the PS has its own address space, learners and PS communicate via mmap-ed memory. This approach is similar to Grace, which transforms multi-threaded program to multi-process program communicating via shared memory. When GaDei starts, learners and PS mmap the same memory file, which pre-allocated gradient queues, shared weights, and thread related synchronization variables (e.g., mutexes and condition variables, both with PTHREAD_PROCESS_SHARED attributes set).
PS periodically communicates with the watchdog process to log its progress and checkpoint the parameters.
When learners unexpectedly die (, is the number of learners), PS continues to process gradients collected from alive learners so that failures from the dead learners are naturally isolated. Note PS is stateless in that it only needs to process a fixed number of gradients without considering which learners produced the gradients.
If all learners die, PS no longer makes progress, thus the watchdog process kills the PS and restarts PS and learners from the last checkpoint. If a learner dies when holding a lock, PS will hang when it tries to grab the lock. Thus the watchdog process will later detect the failure and take action. Alternatively, one may set the robust attribute of pthread mutex so that when PS is grabbing the lock it can notice the failed learner and skip checking that learners’ gradient queue.
4.4 Additional Design Decisions
Computation optimization inside GaDei. The major
computation in GaDei’s PS server is element-wise vector addition for
updating the global weights. We unroll the weights update loop on the parameter server side 8
times in the OpenMP parallel section, and found that it outperforms the
non-unrolled version by 30%. In addition, it outperforms the
version in which we directly use the Streaming SIMD Extension
Half-precision floating point operations. Recent research work[24, 25] demonstrate that it is possible to train DNNs with lower precision and still obtain comparable model accuracy. Comparing to 32-bit single precision, the 16-bit half-precision data format will decrease parameter server processing burden. The latest GPU architecture may support half-precision 16-bit float point operation, while general purpose CPUs do not. We integrated software-based 16-bit CPU floating point operations in GaDei. However, It slowed down the computation by a factor of 5, which makes any speedup impossible for our workloads.
Lock-free producer-consumer queue. A lock-free producer-consumer queue was considered in our design. Lock-free design can minimize the inter-process/thread latency, but would consume a CPU core in 100% for each learner. As a result, the computation threads have to compete with the lock-free communication threads, and this will impede the computation performance significantly. In GaDei, when learners have to wait for the server to process gradients, yielding CPUs via conditional variable wait can save CPU cycles for PS computation threads to process gradients.
|Data Size||Label Size||Type||Description||Model Size||Network|
|Joule||2.4k||311||NLC||Question answering task in insurance domain; label represents answers.||7.69MB||CNN|
|Watt||7k||3595||NLC||Question answering task in online service domain; label represents answers.||20.72MB||CNN|
|Yelp||500k||5||NLC||Customer review classification; label represents the star the customer assigns to the business.||98.60MB||CNN|
|Age||68k||5||NLC||Author profiling task; the label represents the age range of the author.||72.86MB||CNN|
|MR||8.6k||2||NLC||Sentiment analysis of movie reviews; label represents positive/negative attitude of the audience.||14.27MB||CNN|
|CIFAR||50k||10||IR||Classify images into predefined categories.||59.97MB||VGG|
|ImageNet||1280k||1000||IR||Classify images into predefined categories.||244.48MB||AlexNet|
We use open source toolkit Torch  as the building block to calculate the gradients of neural nets. Torch is a scientific computing framework based on Lua/LuaJIT with both CPU and CUDA backends. Torch has been widely used both in industry (e.g., Facebook, Google Deepmind, and IBM) and in academic community. Researchers at Google, Bosch, and Facebook
have benchmarked several commonly used open-source deep learning frameworks (e.g. Caffe, Theano, Torch, and Tensorflow) and found that Torch usually outperforms other frameworks. In addition, Torch is the only framework that supports ASGD on one-node.Thus to enable further speedup on top of Torch presents a bigger challenge for the system design.
We use four representative NLC datasets in our evaluation. Two datasets Joule and Watt are in-house customer datasets. The other two datasets Age and Yelp are publicly available datasets. The Joule and Watt datasets represent the typical small datasets present in NLC workloads. Age dataset represents the medium-size dataset. Yelp dataset represents the large-size dataset. The learning rate is fixed at 0.01 across workloads, GaDei trains each NLC workload for 200 epochs. The neural network model used for the four datasets is presented in Figure 2. To demonstrate that GaDei can be used as a drop-in replacement for tools that solve non-TaaS tasks, we also conducted experiments on image recognition tasks CIFAR  and ImageNet . We train CIFAR using VGG model and we train ImageNet using AlexNet model. We use the widely-adopted hyper-parameter setup, as described in [49, 12], to train CIFAR and ImageNet tasks to demonstrate no accuracy loss is incurred.Table 5 records the task description and training data statistics.
The experiments have been conducted on the Softlayer cloud555http://www.softlayer.com. The server is equipped with two Intel Xeon E5-2690-V3 processors. Each processor has 12 real cores, clocked at 2.66GHz per core. To enable the best possible PS CPU processing speed, we turn off SMT. The CPU memory capacity is 128GB, with peak memory bandwidth 40GB/s. There are two NVIDIA Tesla K80s installed on the server. Each K80 has two GPUs. Totally there are four GPUs on the server with a total 16 TFlops. The bus interface of K80 is PCIe 3.0 x16, with a 12Gbps bi-directional bandwidth each lane.
6 Experiment Results
In this section, we evaluate GaDei’s ability to achieve good model accuracy and runtime performance, against widely used state-of-the-art open-source multi-GPU implementation DPT and mpiT on both commercial workloads and public workloads. Section 6.1 demonstrates how the model accuracy progresses w.r.t training epoch. Section 6.2 illustrates the runtime performance of GaDei. Section 6.3 illustrates how GaDei achieves good speedup on challenging NLC workload, while other state-of-the-art tools cannot achieve any speedup. GaDei can achieve speedup using much smaller mini-batch size than any other tools. Section 6.4 discusses GaDei’s ability to handle fault-tolerance and GPU over-subscription.
6.1 Convergence result
In Figure 7 we plot the model accuracy w.r.t the training epochs when using 1,2,3,4 learners. Joule and Watt converge to 60% , Age 80%, Yelp 62%, CIFAR 90%, ImageNet 55%. Model accuracy reaches the same level () of accuracy as the single learner system within the same number of training epochs.
GaDei converges to the same level of accuracy as the single-learner SGD using the same number of epochs. This demonstrates a tightly-coupled system such as GaDei can mostly avoid the staleness issue introduced in a typical ASGD system, when using small minibatch size.
6.2 Speedup Results and Memory Bandwidth Analysis
|Workload||Model Size||Epochs||Running Time|
Table 6 records the single-learner performance baseline for comparison. In Figure 8, the speedup performance of GaDei is plotted. When running on challenging commercial IBM Watson NLC workloads, GaDei can achieve on average 1.5X - 3X speedup when using up to 4 learners. When running on public image recognition benchmark tasks, GaDei achieves near linear speedup. Dividing the total amount of data transferred between learners and parameter server by the total runtime, the memory bandwidth utilized by GaDei is reported in Table 7. When running Watson workloads on 4 GPUs, GaDei sustains 36-55GB/s bandwidth, which is close to the hardware limit.
|Memory Bandwidth Utilized(GB/s)|
GaDei achieves linear speedup on public dataset and model, and achieves good speedup on challenging commercial workload. GaDei comes close to saturating the hardware memory bandwidth.
6.3 Compare with DPT and mpiT
We compare GaDei’s speedup with that of mpiT and DPT, two state-of-the-art scale-up deep learning training frameworks. Figure 9 shows that GaDei consistently outperforms other tools. For NLC workload, DPT and mpiT actually slow down the execution. DPT has inferior performance because it is a SSGD style implementation, PS blocks all learners when updating weights, whereas GaDei implements ASGD style synchronization protocol. mpiT has inferior performance because it does not minimize memory copies (i.e., there is at least one extra copy from PS/learners to MPI runtime), and its message-passing style implementation makes HogWild! lock-free weights update impossible.
, the smaller batch size a system can support, the higher probability a deep learning model can reach a desirable model accuracy. GaDei supports much smaller batch size, often the smallest possible size that is constrained only by the underlying model, than other tools.
GaDei significantly outperforms state-of-the-art open source implementations. It achieves good speedup on commercial workloads whereas existing open source implementations slow down the execution. On public benchmark, GaDei also outperforms existing open-source alternatives. GaDei can speedup workload using batch size that is an order of magnitude smaller than other state-of-the-art tools.
6.4 Fault tolerance and over-subscription of learners
We randomly kill learners and verify GaDei can always finish training with desired model accuracy. In contrast, mpiT runs on top of MPI , which traditionally does not provide fault-tolerance mechanism. DPT orchestrates multiple learners in the same process, thus when one learner fails, the entire process is killed. Further, nccl implements blocking collective operations, and their behavior in the presence of failure is not defined.
GaDei supports over-subscription of GPUs , i.e., run learners over GPUs, where
. Recurrent neural networks, such as long short-term memory (LSTM), are known to be difficult to efficiently parallelize due to complicated computation dependencies. When such a model is adopted, GPUs usually operate at much lower efficiency compared to running CNNs. By oversubscribing GPUs, GaDei can fully utilize GPU resources. In another cognitive task, where LSTM is utilized, GaDei achieves 7-fold speedup when running 8 learners on 4 GPUs. We are also able to deploy 16 learners on 4 GPUs for NLC tasks to extrapolate their convergence behavior as if we had a 16-GPU server installation.
GaDei supports fault-tolerance and GPU over-subscription. To the best of our knowledge, GaDei is the only scale-up deep learning system that supports fault-tolerance.
7 Related work
Deep learning has seen tremendous success in image recognition 21], speech translation  and gaming . The parameter server  based approach is the de facto method to scale out such training tasks in a distributed environment. Several research works optimize PS performance and tackle fault-tolerance problems [19, 11, 34] in a CPU-only enviroment. As GPU-based deep learning frameworks [26, 7, 16] offer a much better cost-effective solution, GPU-based scale-out PS architecture, such as Mariana , is optimized for distributed GPU environment. GeePS  overcomes the memory capacity limit on the GPU device by loading a part of a model (instead of the entire model) that is necessary for computation to GPU at a given point, and treats CPU memory as a large data-cache.
Different from previously published work, IBM Watson Cognitive Service receives very different types of training data from worldwide customers and returns trained model individually. Consequently, it is imperative to set conservative hyper-parameter configuration. This requires extremely high communication bandwidth that renders scale-out solutions infeasible. State-of-the-art scale-up solutions, such as nccl-based SSGD implementation DPT and MPI-based ASGD implementation mpiT incur relatively large communication overhead due to intermediate data copy and execution stall. Our solution solves these issues by minimizing the data copy and making the whole system highly concurrent.
In this paper, we focus on the system design challenges for emerging training-as-a-service (TaaS) workloads. By analyzing the characteristics of representative industrial workloads, we identify that to satisfy diverse customer requirements, a TaaS system needs to choose conservative hyper-parameter setup (e.g., small mini-batch size). We provide both empirical evidence and theoretical justification for such a design choice. We then characterize the communication bandwidth requirement for TaaS workloads and conclude that none of the state-of-the-art solutions can satisfy this requirement.
We present GaDei, a scale-up deep learning framework, that maximizes communication bandwidth utilization in a tightly-coupled system. GaDei enables efficient multi-learner training for arbitrary type of neural networks (e.g., CNN, RNN). We further verify the correctness of the GaDei’s communication protocol. Our evaluation results demonstrate that GaDei significantly outperforms state-of-the-art scale-up solutions on industrial workloads and public workloads, usually by an order of magnitude. In addition, GaDei provides fault-tolerance, which is missing in other scale-up solutions.
-  Author profiling. http://pan.webis.de/clef16/pan16-web/author-profiling.html.
-  Yelp challenge. https://www.yelp.com/dataset_challenge.
-  Mpi 3.0 standard. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M.,
Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R.,
Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P.,
Wicke, M., Yu, Y., and Zheng, X.
Tensorflow: A system for large-scale machine learning.In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (GA, 2016), USENIX Association, pp. 265–283.
-  Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J., Fan, L., Fougner, C., Hannun, A. Y., Jun, B., Han, T., LeGresley, P., Li, X., Lin, L., Narang, S., Ng, A. Y., Ozair, S., Prenger, R., Qian, S., Raiman, J., Satheesh, S., Seetapun, D., Sengupta, S., Wang, C., Wang, Y., Wang, Z., Xiao, B., Xie, Y., Yogatama, D., Zhan, J., and Zhu, Z. Deep speech 2 : End-to-end speech recognition in english and mandarin. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (2016), pp. 173–182.
-  Bahrampour, S., Ramakrishnan, N., Schott, L., and Shah, M. Comparative study of caffe, neon, theano, and torch for deep learning. CoRR abs/1511.06435 (2015).
-  Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
-  Berger, E. D., Yang, T., Liu, T., and Novark, G. Grace: Safe multithreaded programming for c/c++. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (New York, NY, USA, 2009), OOPSLA ’09, ACM, pp. 81–96.
-  Bergstra, J., and Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (Feb. 2012), 281–305.
-  Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).
-  Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. Project Adam: Building an efficient and scalable deep learning training system. OSDI’14, pp. 571–582.
-  Chintala, S. Multi-gpu imagenet training. https://github.com/soumith/imagenet-multiGPU.torch.
-  Chintala, S. convnet-benchmark. github.com/soumith/convnet-benchmarks, 2016.
-  Cho, K., Van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Doha, Qatar, 2014), Association for Computational Linguistics, pp. 1724–1734.
-  Coffman, E. G., Elphick, M., and Shoshani, A. System deadlocks. ACM Comput. Surv. 3, 2 (June 1971), 67–78.
-  Collobert, R., Kavukcuoglu, K., and Farabet, C. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop (2011).
-  Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2014), USENIX ATC’14, USENIX Association, pp. 37–48.
-  Cui, H., Zhang, H., Ganger, G. R., Gibbons, P. B., and Xing, E. P. Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys 2016, London, United Kingdom, April 18-21, 2016 (2016), p. 4.
-  Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In NIPS (2012).
-  Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09 (2009).
-  Feng, M., Xiang, B., Glass, M. R., Wang, L., and Zhou, B. Applying deep learning to answer selection: A study and an open task. In Proceedings of the 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015). (2015), vol. abs/1508.01585.
-  Feng, M., Xiang, B., and Zhou, B. Distributed deep learning for question answering. In The 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) (Indianapolis, IN, USA, October 2016), Association for Computing Machinery.
-  Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. Book in preparation for MIT Press, 2016.
-  Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (2015), D. Blei and F. Bach, Eds., JMLR Workshop and Conference Proceedings, pp. 1737–1746.
-  Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. 5th International Conference on Learning Representations (2016).
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (New York, NY, USA, 2014), MM ’14, ACM, pp. 675–678.
-  Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Baltimore, Maryland, June 2014), Association for Computational Linguistics, pp. 655–665.
-  Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. ICLR’17.
-  Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Doha, Qatar, October 2014), Association for Computational Linguistics, pp. 1746–1751.
-  Krizhevsky, A. Learning multiple layers of features from tiny images. Tech. rep., 2009.
-  Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. 2012, pp. 1097–1105.
-  Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097–1105.
-  Lecun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature 521, 7553 (5 2015), 436–444.
-  Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, Oct. 2014), USENIX Association, pp. 583–598.
-  Li, M., Zhang, T., Chen, Y., and Smola, A. J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2014), KDD ’14, ACM, pp. 661–670.
-  Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization. Neural Information Processing Systems (June 2015).
-  Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2008), ASPLOS XIII, ACM, pp. 329–339.
-  Niu, F., Recht, B., Ré, C., and Wright, S. J. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Neural Information Processing Systems (2011).
-  Nvidia. Nccl: Optimized primitives for collective multi-gpu communication. https://github.com/NVIDIA/nccl.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei,
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
-  Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85–117. Published online 2014; based on TR arXiv:1404.7828 [cs.NE].
-  Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature 529 (2016), 484–503.
-  Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (2015).
-  Smola, A., and Narayanamurthy, S. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1-2 (2010), 703–710.
-  Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 2951–2959.
-  Suyog Gupta, Wei Zhang, F. W. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In IEEE International Conference on Data Mining (2016).
-  Torch. Torch data parallel table. https://github.com/torch/cunn/blob/master/DataParallelTable.lua. Available at https://github.com/torch/cunn/blob/master/DataParallelTable.lua.
-  Wilson, D. R., and Martinez, T. R. The general inefficiency of batch training for gradient descent learning. Neural Netw. 16, 10 (Dec. 2003), 1429–1451.
-  Zagoruyko, S. Vgg cifar. https://github.com/szagoruyko/cifar.torch.
-  Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 685–693.
Zhang, W., Gupta, S., Lian, X., and Liu, J.
Staleness-aware async-sgd for distributed deep learning.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016(2016), pp. 2350–2356.
-  Zou, Y., Jin, X., Li, Y., Guo, Z., Wang, E., and Xiao, B. Mariana: Tencent deep learning platform and its applications. Proc. VLDB Endow. 7, 13 (2014), 1772–1777.