1. Introduction
In the last five years, neural networks and deep architectures have been proven very effective in application areas such as computer vision, speech recognition, and machine translation. The recent breakthroughs of AlphaGo further cement interest in employing deep architectures to develop intelligent machines. Although deep architectures such as convolutional neural networks (CNNs)
(LeCun et al., 1998; Krizhevsky et al., 2012; Graves et al., 2013), recurrent neural networks (RNNs)
(Zaremba et al., 2014; Graves and Jaitly, 2014), and restricted Boltzman machines (RBMs)
(Fischer and Igel, 2012; Krizhevsky et al., 2010) have been around since the 1980s, they have never been under the spotlight. Why are they thriving now? The convincing factor this time around is scale, in both data volume and computation resources.When the scale of training data is small, all supervised learning algorithms (e.g., decision trees, support vector machines, and logistic regression) achieve the same level of classification accuracy. In 2012, AlexNet
(Krizhevsky et al., 2012)demonstrated that with millions of training images from ImageNet
(Deng et al., 2009), CNNs substantially outperform all prior works on image classification. Since then it has been shown in several vertical domains that large training datasets can improve the accuracy of classification tasks.Since the computation complexity of a deep learning algorithm is high (e.g., the convolution stage of CNNs requires a sixlevel nested loop), the scale of data demands scalable computation resources. Fortunately, processor speed has soared more than one thousand folds in the last three decades. In addition, with specialized arrays of processors (e.g., GPUs) and accessibility of parallel computing infrastructures via the cloud, millions of cores can be utilized simultaneously for training. However, scaling up computation is not merely throwing in an infinite number of cores. As Amdahl’s law (Amdahl, 1967) states, the nonparallelizable portion of a computation task such as communication, I/O, and interprocess synchronization may cap computation speedup. For instance, if the nonparallelizable portion is , reducing computation time to zero achieves only a speedup factor of two. All deep learning frameworks involve substantial nonparallelizable overheads, which must be carefully mitigated to speed up training time.
Several opensource projects (e.g., Caffe
(Jia et al., 2014), MXNet (Chen et al., 2015)(Abadi et al., 2015), and Torch
(Collobert et al., 2011)) have been devoted to speeding up training deep networks. They can be summarized into two approaches: deeplearning algorithm optimization and algorithm parallelization (details of related work are presented in Section 1.1). The former includes using better convolution algorithms, improving stochastic gradient decent (SGD) with faster methods, employing compression/quantization, and tuning the learning rate with advanced optimization techniques. Indeed, most opensource libraries have quickly adopted available stateoftheart optimizations. However, most users in academia and industry do not know how to set parameters, algorithmic and system, to conduct costeffective training. Researchers and professionals face at least the following questions in three levels, which are intraGPU, interGPU, and intermachine:
What is the bottleneck of speeding up deep learning training by parallelism?

With amount of data, what is the size of each minibatch () and how to maximize GPU utilization?

How many GPUs () should be employed, and how should such a system be configured?

How many parameter servers () should be deployed when building a distributed system?
In this work, we aim to answer the above questions by providing system configuration guidelines given the characteristics of the training data (the number of training instances and the size of each training instance), as well as hardware parameters (such as GPU memory size, internal transmission bandwidth, e.g. bus bandwidth, and external transmission bandwidth, e.g. network bandwidth). We identify computation bottlenecks and I/O overheads of representative frameworks. From the insights we observed in benchmarking, we propose guidelines allowing users to configure a highperformance deep learning system for their target tasks.
1.1. Related Work
Since deeplearning training is timeconsuming, many previous studies devoted to speeding up the training performance. These prior contributions can be divided into two approaches: algorithmic and system. The algorithmic approach accelerates the training algorithm, whereas the system approach focuses on employing improved resources to achieve parallel training. To ensure scalability, the system approach may require enhancing the training algorithm to take full advantage of the increased resources.
1.1.1. Algorithmic Approach
Stochastic gradient descent (SGD) is the de facto optimization algorithm for training a deep architecture. Many SGD techniques have been developed for achieving faster convergence to the global minimum. The settings of hyperparameters such as learning rate and minibatch size are crucial to the training performance. Hinton and Bengio (Hinton, 2010; Bengio, 2012) provide recommendations on setting hyperparameters commonly used in gradientbased training. Batch renormalization can be an effective strategy to train a network with small or noni.i.d minibatches (Ioffe, 2017). Momentumbased acceleration schemes increase the speed of learning and damp oscillations in directions of high curvature (Polyak, 1964). Perparameter adaptive learning rate methods help reduce large gradients and decrease the learning rate over time (Duchi et al., 2011).
More efficient algorithms can improve speed. The execution time of convolution consumes to of CNNbased training. Some FFTbased convolution schemes were proposed (Mathieu et al., 2013) to achieve speedup. Additionally, Firas et al. proposed three matrix layout schemes using lowering operations (Hadjis et al., 2015). Caffe con Troll implements a CPUGPU hybrid system that contains several lowering operations, and at the same time, employs a simple automatic optimizer to select the best lowering. Some compression algorithms (Elgohary et al., 2016) are developed for both good compression ratios and fast decompression speed to enable blockwise uncompressed operations, such as matrix multiplication are executed directly on the compressed representations.
1.1.2. System Approach
A deep learning training job consists of two computationally intensive arithmetic operations: matrix multiplication and convolution. A GPU is wellsuited for speeding up such operations since these operations are easy to be parallelized. To achieve further speedup, the next logical step is to employ multiple GPUs, and to configure a distributed clusters of CPUs and GPUs. The computation time can be largely reduced via data parallelism and/or model parallelism. Many projects have proven parallelism to be helpful (Chilimbi et al., 2014; Dean et al., 2012; Krizhevsky, 2014; Niu et al., 2011; Iandola et al., 2016; Zhang et al., 2015).
According to Amdahl’s law, the peak performance of a parallel architecture is capped by the overhead portion of the computation task. In the context of deep learning, its training overhead includes synchronization between distributed threads, disk I/O, communication I/O, and memory access. To reduce synchronization delay, Zinkevich et al. (Zinkevich et al., 2010) proposed an asynchronous distributed SGD algorithm to guarantee parallel acceleration without tight latency constraints. Chen et al. (Chen et al., 2016) proposed adding backup workers in synchronous SGD algorithm to mitigate the bottleneck. To reduce the impact of I/O on the overall speedup, most opensource frameworks (see Section 1.1.3) attempt to conceal I/O behind computation via the pipeline approach proposed in (Liu et al., 2011)
. Such approach requires a computation unit to be sufficiently long so as to hide I/O overheads as much as possible. The pipeline approach, however, demands carefully setting up the unit size of computation (or minibatch size) and the number of parameter servers. We will propose how to best estimate these configuration parameters in Section
3.1.1.3. Computation Frameworks
There have been several deep learning opensource efforts. Here, we introduce representative frameworks^{1}^{1}1Due to limited information available, some frameworks, such as CNTK from Microsoft (Dally, Dally)
and Theano
(James, Olivier, Frédéric, Pascal, and Razvan, James et al.), are not covered.:
Caffe: Caffe (Jia et al., 2014) is maintained and developed by the Berkeley Vision and Learning Center (BVLC) and has become opensource since 2014. Caffe was first designed for vision, and has been adopted and improved by users in several domain, such as speech recognition and robotics. In Caffe, some extensible toolkits are provided for stateoftheart deep learning algorithms. Caffe separates network representation from actual implementation, and supports seamless switching between opensource platforms.

MXNet: MXNet (Chen et al., 2015) is designed for portability (i.e., supporting multiple languages and operating systems), scalability (i.e., running on multiple machines, GPUs and CPUs), and efficiency. Additionally, MXNet provides a superset programming interface to be compatible with other frameworks. MXNet is lightweight and it enjoys multiple programming language supports, e.g., Python, R, Julia and Scala.

TensorFlow: TensorFlow (Abadi et al., 2015), which supports distributed computation, is an opensource framework developed by Google. TensorFlow’s design philosophy is flexibility, portability, and high efficiency. TensorFlow takes computations described by using a dataflow model and maps them onto a wide variety of hardware platforms. TensorFlow allows clients to easily express various kinds of parallelism through replication and parallel execution of a core model dataflow graph, with many different computational devices all collaborating to update a set of shared parameters or states.

Torch: Torch (Collobert et al., 2011) is designed to be easy for developing and extending numerical algorithms. Based on this philosophy, Torch leverages Lua language, a fast interpreted language (with also the fastest Just In Time (JIT) compiler), to embedded in a C application and provides APIs in C, making library wrapping easily for the unifying interface to C/C++.
Among the introduced frameworks, MXNet and TensorFlow are builtin distributed training frameworks. Users can easily develop algorithms running on computing clusters with thousands of CPUs or GPUs. Several works are proposed to give users a glimpse on the factors that they must take into consideration. Bahrampour et al. (Bahrampour et al., 2015) provide a comparative study on different frameworks with respect to extensibility, hardware utilization, and performance. Shi et al. (Shi et al., 2016) provides performance study on selected frameworks. These works offer practitioners a highlevel guideline to select an appropriate framework. Given a selected framework, our work aims to provide further configuration guidelines to make training both fast and costeffective.
1.2. Contribution Summary
In summary, this work makes the following contributions:

Identifying computation bottlenecks and devising their remedies. We benchmark representative networks and datasets to identify the typical bottlenecks of largescale training. We then devise remedies to reduce or mask computation overheads (I/O and communication) to improve training speed.

Quantifying remedies into an optimization model. We formulate our remedies into an optimization model to determine the optimal minibatch size and carefully balance memory and speed tradeoffs so as to employ the fastest algorithms given the memory constraint.

Recommending distributed configuration involving multiple GPUs and parameter servers. When the workload cannot be handled by a single GPU or machine, we propose lemmas to recommend the number of GPUs and parameter servers to configure so as to achieve costeffective speedup.
Both realworld deployment and empirical studies attest our remedies to be very effective.
2. Preliminaries
This section presents a typical deep learning training process including performance factors and their relevant parameters. We then show the setup of the evaluation environment.
2.1. Deep Learning Training Process
Figure 1 depicts a general architecture of deeplearning training and data flow. A local architecture is basically a commodity computer equipped with GPUs. When aiming to improve parallelism via a distributed architecture, a worker and a parameter server can be replicated into multiple copies connected by a network. The minibatch processing pipeline in the training process consists of seven steps. After the model parameters and the data processing pipeline is initialized, the training process repeats until all training data is seen.

Parameter refresh. In distributed training, the latest copy of model parameters is pulled from parameter servers at the beginning of each minibatch processing. is then loaded onto GPU memory. A distributed environment consists of workers and parameter servers for managing shared parameters.

Data loading. A subset of the training instances called minibatch of size is loaded from the persistent storage to the main memory.

Data preparation. instances are transformed into the required input format. These instances may be augmented to mitigate the overfitting problem and enrich sample diversity.

Host to GPU transfer. The minibatch is loaded onto the memory of a GPU. If GPUs are employed, different minibatches are loaded onto GPUs.

GPU processing. Required computations including matrix multiplication and convolution are performed on GPUs for the gradients against the given minibatch.

Parameter update. The delta is derived from the gradients and applied to the previous version of in main or GPU memory.

Distributed update. The parameter updates are sent to parameter servers when distributed machines are configured.
Among the seven steps, step performs computation, and the other steps that cannot be hidden behind step are considered as overheads. The larger fraction of the time which those overhead steps take, the less effective parallelism can achieve. Therefore, our tasks are minimizing overhead time and hiding overheads via pipelining as much as possible. The remainder of this paper is to demonstrate how the following parameters can be carefully tuned to achieve such goals, organized into four sections. In section 3.1, we provide a procedure to recommend a minibatch size that leads to maximum training performance. Section 3.2 provides an indepth analysis on training in a multiGPU environment. We provide a lemma to estimate the number of GPUs for a desired factor of speedup. The increase of GPU number not only improves performance speedup, but also induces communication overheads between GPUs. We’ll also discuss how to alleviate the impacts of these overheads. In section 3.3, we address issues involving distributed workers. When the training system scales horizontally, we need an extra cluster to manage the parameters in addition to training hosts in the distributed environment. The communication between training hosts and parameter servers is an overhead that could seriously degrade training speedup. We propose a scheme to estimate the number of parameter servers given network bandwidth .
2.2. Evaluation Environment
We set up our evaluation environment with Elastic Compute Cloud (EC2) of Amazon Web Services (AWS)^{2}^{2}2 GPU instances on Google Compute Engine (GCE) do not support GPU peertopeer access, and hence we will defer our GCE experiments till such support is available.. All experiments run on EC2 P2 instances equipped with NVIDIA Tesla K80 Accelerators which contain a pair of NVIDIA GK210 GPUs. Each GPU provides GB memory and parallel processing cores. The CPU is a customized version of Intel Broadwell processor running at GHz. Table 1 shows hardware configurations of P2 type instances^{3}^{3}3p2.16xlarge is not used in our experiments because it does not support full GPUtoGPU communication which introduces one more variable in our multiGPU experiments.. To avoid unexpected GPU clock rate adjustment in our experiments, we disable GPU autoboost function.
Instance  #GPU  GPU Mem.  Network 

p2.xlarge  1  12 GB  High 
p2.8xlarge  8  96 GB  10 Gbps 
p2.16xlarge  16  192 GB  20 Gbps 
We perform experiments and demonstrate our ideas by MXNet and TensorFlow. Virtual machines are launched from Amazon deep learning AMI (Amazon Machine Image) preloaded with NVIDIA CUDA toolkit and cuDNN . We conduct experiments on the ILSVRC2012 dataset, the subset of ImageNet (Deng et al., 2009) containing categories and million images on SSD. The other set containing labeled images is used as validation data.
3. Configuration of High Performance Training System
We study configuration in three incremental steps, starting from a single GPU, then expanding our benchmarking to multiple GPUs, and finally to distributed nodes where each node consists of multiGPUs. Each of these three steps focuses on analyzing one system configuration.
In the single GPU study, we analyze how the minibatch size can be decided to achieve fast training speed. Most prior studies only consider tuning
algorithmically, that is, selecting a size that can achieve fast convergence. However, taking the minimum number of epochs to reach convergence does not directly translate to shortest training time. In Section
3.1 we provide system analysis to determineand solve optimized minibatch selection with integer linear programming.
As multiple GPUs are employed to conduct training, data moving is the major bottleneck, which caps the speedup performance according to Amdahl’s law. Therefore, to be costeffective, we should not use more GPUs when speedup improvement has saturated. Section 3.2 presents a systematic procedure to estimate an effective number of GPUs .
When training is conducted in a distributed environment, we further study communication overhead. Section 3.3 depicts the distributed training process and provides a lemma to estimate the required number of parameter servers in a costeffective system configuration.
3.1. Training on single GPU instance
In this section, we first point out the common performance pitfalls in designing neural networks. We illustrate that the setting of minibatch size is the primary factor that determines training speed. We then formulate selecting the minibatch size as an optimization problem and provide a procedure to solve for that can achieve fastest training speed.
3.1.1. Identifying System Issues
Most neural networks are initially designed according to some heuristics. Researchers may not have the full picture about their model’s feasibility, convergence quality, and prediction quality unless they conducted some experiments. During the experimental process, various hyperparameter values may be tested exhaustively by a trialanderror process. According to our own experience, it is typically unknown at the beginning to know how long it would take to run a round of training job, let alone configure a costeffective system that can maximize training speed. A suboptimal system configuration can lead to excessive execution time because of encountering the following issues:

Shortage of GPU memory space. A GPU cannot commence computation without the data, including model parameters, gradients, computation workspace, etc, being loaded into GPU memory. A neural network designed without system knowledge may require more memory capacity than available memory. This excessive memory use may cause unnecessary thrashing and prolong training time.

Ineffective tradeoff between speed and memory. Deep learning frameworks may execute operations of a training task by using different algorithms, which have different speed and memoryuse tradeoffs. The selection of using which algorithm is a layerdependent decision. The selection factors include input data size, layer parameters, minibatch size, and available GPU memory space. Consider the convolution operation as an example. An FFTbased algorithm runs faster than a GEMMbased one but it requires more memory. The training speed may be degraded when a large exhausts memory capacity in order to run a faster FFTbased algorithm. Thus, when tuning factors mentioned above, we should consider the impact on memory consumption because the memory budget affects the selection of algorithm.
Both training convergence and training speed can be decided by minibatch size. In other words, selecting a good minibatch size, one must examine from both the algorithmic and system aspects. From the algorithmic aspect, the minibatch size is suggested to be larger than the number of output classes and a minibatch contains at least one sample from each class (Hinton, 2010). The diversified training data leads to more stable convergence. From the system aspect, a proper minibatch size helps to improve the parallelism inside GPU and enables the faster implementation of an operator. Based on the suggested minibatch size considering the algorithmic aspect, we introduce the system aspect into deciding .
3.1.2. Choosing Convolution Algorithms
There are two timeconsuming operations in deep learning: matrix multiplication and convolution. Parallelizing matrix multiplication is rather straightforward, whereas speeding up convolution involves memory and speed tradeoff.
Layer  Parameters  FFT/ 
(,,,,,,,)  GEMM  
x  
x  
x  
x  
x 
Two representative convolution algorithms are GEMM based (Chetlur et al., 2014) and FFT based (Mathieu et al., 2013)
. GEMMbased algorithms converts convolution to a matrix multiplication, which can be slow but the up side is that it requires less memory space. FFTbased algorithms run faster than GEMMbased by using efficient matrix multiplication and reducing the number of floating point operations. However, FFTbased algorithms demand substantially more memory as the filters are padded to be the same size as the input. In addition, FFTbased algorithms require extra memory space for feature mapping on domain transformation. Table
2 shows five convolution layers of AlexNet and their memoryusage ratios of FFT over GEMM given minibatch size . The memory space required by the first layer with FFT is times of that required by GEMM. (The parameters and represent the number of pixels of the inputs and outputs at the layer, respectively. Similarly, the parameters and represent the depths of the inputs and outputs at the , respectively. The parameter represents the size of filters.)To further understand the impact of , we experimented with MXNet and TensorFlow, and plot system throughout () versus () in Figure 2. Although different frameworks may yield different throughputs, the trend remains the same, that is, the system throughput degrades once after reaches a threshold. The reason why the throughput drops is that MXNet and TensorFlow choose to run a slower convolution algorithm due to the constrained free memory caused by the increased . How to determine the optimal ? We next formulate the problem of determining as an optimization problem.
3.1.3. Optimizing Minibatch Size
In order to formulate the problem of determining , we first define a memory constraint , which is built into the later optimization formulas for . During our formulation, most of the symbols follow in the same fashion of (CS2, 2017).
Deriving .
We assume that a CNN such as AlexNet (Krizhevsky et al., 2012)
consists of two major components: feature extraction and classification. Further, we assume that the feature extraction part comprises of
layers where stacked convolution layers are optionally followed by pooling layers, and the classification part consists of fullyconnected layers. We use and where to represent the sizes of inputs and outputs of convolution layers (or pooling layers), respectively. In particular, the size represents the size of input data. If we take training AlexNet on the ImageNet (Deng et al., 2009) as the example, is equal to . For the layer of convolution and pooling layers, we denote its spatial extent (i.e. the size of filters) as, its stride as
, its amount of padding as , and its number of filters as . Please note that if the layer is a pooling layer, its is equal to zero, i.e. . Thus, the inputs and outputs in the feature extraction part have the following relations:(1)  
The memory allocated for the feature extraction part of CNNs includes the input data, outputs (i.e. feature maps) of all the layers, model parameters, and gradients. We assume that all the values are stored by using single precision floating point (bits). Based on the aforementioned notations and Equation 1, the memory usage for the input data and outputs of all layers in the feature extraction part can be calculated as follows:
(2) 
Regarding the model parameters, there are two kinds of parameters: weights and biases. Though the biases are often omitted for simplicity in the literature, we take them into account here in order to estimate the memory usage precisely. Besides, we assume that the size of the gradients is twice as the size of the model parameters ^{4}^{4}4For each training instance, we need to store the gradients of all model parameters. The aggregated gradients of all model parameters are also required for a specific batch.. Thus, we can derive the memory usage for the model parameters and their related gradients by the following equation:
(3)  
Furthermore, the memory allocated for the classification part of CNNs contains the outputs of all neurons and model parameters. We use
where to denote the number of neurons at layer. Again, we make the same assumption that the size of the gradients is twice as the size of the model parameters. Therefore, the memory usage for the classification part of CNNs is as follows:(4)  
According to Equations 2 to 4, the memory constraint can be approximately determined by the following equation:
(5) 
where is the total memory of a GPU in terms of bits.
Deriving .
Assuming that there are kinds of convolution algorithms, and layers in the CNN. (In the case that we have illustrated so far, . Other choices of convolution algorithms can be Winograd minimal convolution algorithm (Lavin and Gray, 2016), Strassen algorithm (Cong and Xiao, 2014), fbfft (Vasilache et al., 2014), etc.) The parameter represents whether the layer uses the convolution algorithm or not. When is evaluated to , it means that the layer uses the algorithm to compute convolution. The value is the time consumption at the layer for the algorithm. The value is the memory consumption at the layer for the algorithm. Thus, the problem of determining can be formulated an optimization problem as follows:
(6)  
where the is derived from Equation 5.
Obviously, Equation 6 is an integer linear programming (ILP) problem (Nemhauser and Wolsey, 1988), which is NPhard. However, there are several offtheshelf heuristic methods and libraries (e.g. GLPK (GLP, 2012)) for solving ILP problems. Given a range of minibatch sizes that can attain good accuracy, we can derive the estimated training time for each minibatch size by solving Equation 6. The minibatch size which leads to the minimal training time is then the suggested .
3.1.4. Refining Model for Speed
This far, we assume that a CNN model is given to determine and layerdependent convolution algorithms to maximize training speed. We can make two further adjustments:

Permit reduction. The researchers may need to compromise on smaller minibatch size if the target one is not feasible or does not deliver acceptable performance under the constraint of GPU memory size. Ghadimi et al. (Ghadimi and Lan, 2013) shows that the convergence rate of SGD on a nonconvex function is bounded by , where is the number of samples seen, i.e., minibatch size. It can be interpreted that a range of minibatch sizes can deliver similar convergence quality. In Figure 3, the axis depicts the epoch number and the axis depicts the top validation error rate^{5}^{5}5AlexNet achieved % top5 error rate in in the ILSVRC2012 competition, whereas we obtained % in our experiments. This is because we did not perform all the tricks for data augmentation and finetuning. We choose % as the termination criterion to demonstrate convergence behavior when minibatch sizes are different.. The figure shows that indeed a range of minibatch sizes enjoy similar convergence quality. Therefore, we could reduce to increase to permit more memory space to run a faster convolution execution to achieve overall speedup.

Permit model adjustment. Suppose that the constrained space of memory prevents us from running a faster algorithm. We could adjust the CNN model to free up some memory. For instance, if the layer can be sped up ten times and the only twice. To accommodate running a faster algorithm for the layer, we could adjust both layers to e.g., use a larger stride or memoryefficient filters.
3.2. Scale with Multiple GPUs
When one GPU cannot handle the training task timely, employing multiple GPUs is the next logical step to share the workload and achieve speedup. When GPUs are used and the maximal efficiency is achieved, the speedup is times. Let denote the system efficiency between and . Lemma 3.1 provides the estimated efficiency given GPUs.
Lemma 3.1 ().
Let denote the total training time, where can be divided into computation time and overhead . Let denote the ratio of overhead or . Suppose the desired efficiency of the system is , where . The efficiency can be estimated as
Proof.
Details of the proof is documented in Appendix A.1. ∎
Lemma 3.1 can be used to estimate system efficiency given and , and also can be used to estimate the acceptable given and . For example, given four GPUs and target efficiency , the ratio of overhead that cannot be hidden behind computation must not exceed .
To estimate , a practitioner can quickly profile the training program for a couple of epochs. Some frameworks such as MXNet and TensorFlow provide the capability to visualize the execution of a training task, which can be used to derive . If a computation framework is not equipped with a profiling tool, one can visualize program execution using nvprof^{6}^{6}6nvprof only profiles GPU activities, so the CPU activities cannot be analyzed.. Suppose a practitioner is asked to make speedup of a training task, and she measures . According to the lemma, she can configure a GPU system to achieve the performance objective.
To evaluate Lemma 3.1, we conduct the training on four neural networks to compare the estimated speedup with actual speedup. Though the estimated is a constant and in realtime overheads could be stochastic, Figure 4 shows that in all cases the estimated speedup matches the the actual speedup. Therefore, the lemma can be used to estimate the performance gain of using GPUs and devise a costeffective training plan including system configuration and parameter settings.
The overall speedup can be improved by reducing computation overheads. We conclude this subsection by providing two overhead reduction suggestions.

Data transfer pipelining. Low throughput of feeding training data is a major bottleneck that degrades the multiGPU training performance as the demand for bus bandwidth for loading data grows with the number of GPUs. Pipelining data loading (I/O) with computation is the effective way to reduce the overhead brought by data preparation. The impact of disk I/O can be further alleviated by using better disk or reducing expensive file operations like seek. Modern frameworks such as TensorFlow and MXNet provide the way to rearrange training samples so that the data can be read in sequentially. The load for decoding and augmenting training data may cause extreme high CPU usage and drags the performance of data provision. The computation intensive jobs should be avoided on CPUs.

Peertopeer parameter updates. Synchronizing parameter updates among GPUs, as indicated in step in Figure 1, is another common bottleneck in multiGPU training environment. A naive implementation is to keep the latest model at main memory, transfer the latest copy to GPUs at the beginning of batch processing, and aggregate updates from all GPUs. It leads to bus contention and huge data load between main memory and GPUs under CUDA programming model. To alleviate the hot spot issue, the weight updates can be completed via GPU highspeed DMA if GPU supports peertopeer transfer.
If multiple GPUs with low computing overhead still cannot meet the desired performance, distributed training is the option you can consider. We’ll discuss the topic in the next section.
3.3. Distributed Training
Distributed training has become increasingly important because of the growth of dataset size and model complexity. To effectively orchestrate multiple machines for a training task, the system must provide a way to manage the globally shared model parameters. The parameter server architecture, i.e., a cluster of machines to manage parameters, is widelyused to reduce I/O latency for handling parameter updates (Liu et al., 2011; Li et al., 2014). As shown in Figure 1, parameter servers maintain latest parameter values and serve all workers. The workers retrieve updated parameters from the cluster, complete computation, and then push updates back to the cluster of parameter servers.
Parameter updates can be performed either synchronously or asynchronously. Employing synchronous updates ensures consistency but suffers from the performance dragger issue. Updating parameters asynchronously gains training speed and may not significantly affect training accuracy according to prior studies (Dean et al., 2012). When I/Os can be performed asynchronously, fetching and updating parameters can be hidden behind computation and hence computation overhead can be mitigated. We assume that an asynchronous update policy is employed.
Let denote the number of parameter servers. How many parameter servers should be configured to hide the computation overhead? We select when can no longer speed up the training task. Before we prove our lemma that derives the most effective , we enumerate two desired subgoals or conditions.
The first subgoal is that the computation duration of a worker should be longer than its communication time with the parameter cluster. In other words, the I/O time between a worker thread and its designated parameter servers is shorter than the computation time of that worker. This condition allows parameters being prefetched before a new round of computation commences. Therefore, the I/O overhead can be hidden behind computation. The second subgoal is to distribute parameterupdate workload evenly among parameter servers. We assume a dynamic loadbalancing policy (e.g., (Chang et al., 1998)) can be employed to distribute parameter retrieval and update workload almost evenly among servers.
Lemma 3.2 ().
Given a round of GPU computation time on a worker, number of workers , network bandwidth , and parameter size , the minimum number of parameter servers required to mask communication I/Os is
Proof.
The total size of communication I/O load generated in a round of pull to and push from parameter servers is . Given that the I/O bandwidth is and the load evenly distributed among servers, the communication time can be written as . The ideal pipeline case (Liu et al., 2011) is when the I/O time can be hidden behind computation time. Therefore, the I/O time must be smaller than or equal to the computation time . (The parameter update time on a parameter server is ignored because that time is relative small comparing with network transmission time.) We can write the constraint to be
(7) 
Isolating on the lefthand side of the equation, we obtain
(8) 
∎
Lemma 3.2 suggests a backoftheenvelop estimate on given two ideal conditions. When the conditions do not hold, more parameter servers should be employed to be able to mask I/O overhead. Three measures are recommended:

Increase . When workload cannot be evenly distributed, the computation time should be longer to mask most I/Os. Therefore, a good strategy is to maintain a large . In other words, having a larger minibatch size when the memory capacity permits is helpful. Besides, a larger minibatch leads to less number of parameter updates and improves overall performance.

Improve . Increasing network bandwidth can reduce I/O time. Insufficient network bandwidth of the communication channel may throttle the training performance. Take AlexNet as an example, pushing parameter updates produces around network traffic, which exceeds the capacity of commonly used Ethernet. Thus, high speed networking is highly recommended when applying distributed training.
4. Concluding Remarks
In this work, we investigated typical deep learning frameworks running on representative deep learning models and datasets. From analyses, we studied the computation bottlenecks in singleGPU, multiGPU and distributed configurations. Furthermore, we derived the backoftheenvelope estimation for the GPU number to configure a training system, given a budget or deadline. Finally, for distributed training, we suggested a formula for estimating the number of parameter servers to be configured to reduce communication overhead.
AlphaGo showed that more training data can only be helpful towards improving machine intelligence and competitiveness. Recently, Residual Neural Networks (He et al., 2016; Szegedy et al., 2016)
shows that in both theory and practice, more layers of neural networks correlates to a higher achieved accuracy by a trained classifier. At a 2016 machine learning workshop
(Ng, 2016), Andrew Ng presented that the traditional biases and variance tradeoff have not appeared in training largescale deep architectures. In other words, the larger the scale, the better suited the architecture is for improving the intelligence of a “machine”.
This “larger the better” conclusion certainly demands that database and machine learning communities devise data management and data mining systems that can handle an ever increasing workload. We foresee that not only will algorithmic research continue flourishing, but system research and development will as well. Already we have seen that GPU vendors are enhancing distributed GPU implementations. Advances in interconnected technology and implementation will help reduce both I/O overhead in data loading and in parameter updates.
In this work, we provided practical guidelines to facilitate practitioners the configuration of a system to speed up training performance. Our future work will focus on effectively managing such largescale training systems to achieve both high accuracy and costeffectiveness in three specific areas:

Flexibility. Prior work (Zheng et al., 2015) provided a flexibility to work with any compatible opensource frameworks. For example, we expect to simultaneously work with multiple frameworks such as MXNet and TensorFlow to complete a largescale training task running on Azure, AWS, GCE, and other available commercial clouds.

Scalability and elasticity. In addition to the parameter estimation performed in this work, we will research dynamic schemes to adjust allocation and scheduling parameters according to the dynamic workload nature of distributed systems.

Ease of management. We plan to devise tools with the good user experience for monitoring and managing the training system.
Appendix A Appendices
a.1. Proof of Lemma 3.1
According to Amdahl’s law, given GPUs and the fraction of the execution time of the task that can be parallelized , the theoretical speedup is . The maximum speedup can not be achieved if there are parts cannot be parallelized. Thus:
(9) 
can be expressed as:
(10) 
Substituting into equation 9 yields:
(11) 
Then:
(12) 
By rearranging equation 12, can be expressed in terms of and as follows:
(13) 
References
 (1)
 GLP (2012) 2012. GNU Linear Programming Kit. https://www.gnu.org/software/glpk/. (2012).
 CS2 (2017) 2017. CS231n Convolutional neural network for visual recognition. http://cs231n.github.io/. (2017).
 Abadi et al. (2015) Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2015. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org 1 (2015).
 Amdahl (1967) Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 1820, 1967, spring joint computer conference. ACM, 483–485.
 Bahrampour et al. (2015) Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. 2015. Comparative Study of Deep Learning Software Frameworks. arXiv.org (Nov. 2015). arXiv:1511.06435v3
 Bengio (2012) Yoshua Bengio. 2012. Practical recommendations for gradientbased training of deep architectures. In Neural Networks: Tricks of the Trade. Springer, 437–478.
 Chang et al. (1998) Edward Chang, Hector GarciaMolina, and Chen Li. 1998. 2D BubbleUp: Managing Parallel Disks for Media Servers. Technical Report. Stanford InfoLab.
 Chen et al. (2016) Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. 2016. Revisiting Distributed Synchronous SGD. arXiv preprint arXiv:1604.00981 (2016).
 Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
 Chetlur et al. (2014) Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). http://arxiv.org/abs/1410.0759
 Chilimbi et al. (2014) Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System.. In OSDI, Vol. 14. 571–582.
 Collobert et al. (2011) Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlablike environment for machine learning. EPFLCONF192376 (2011).
 Cong and Xiao (2014) Jason Cong and Bingjun Xiao. 2014. Minimizing computation in convolutional neural networks. In International Conference on Artificial Neural Networks. Springer, 281–290.
 Dally (Dally) W J Dally. CNTK: An Embedded Language for Circuit Description, Dept. of Computer Science, California Institute of Technology, Display File.
 Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, and others. 2012. Large scale distributed deep networks. (2012), 1223–1231.
 Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, Jul (2011), 2121–2159.
 Elgohary et al. (2016) Ahmed Elgohary, Matthias Boehm, Peter J Haas, Frederick R Reiss, and Berthold Reinwald. 2016. Compressed linear algebra for largescale machine learning. Proceedings of the VLDB Endowment 9, 12 (2016), 960–971.

Fischer and Igel (2012)
Asja Fischer and
Christian Igel. 2012.
An introduction to restricted Boltzmann machines. In
Iberoamerican Congress on Pattern Recognition
. Springer, 14–36.  Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. 2013. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23, 4 (2013), 2341–2368.
 Graves and Jaitly (2014) A Graves and N Jaitly. 2014. Towards EndToEnd Speech Recognition with Recurrent Neural Networks. ICML (2014).
 Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. arXiv.org (March 2013). arXiv:1303.5778v1
 Hadjis et al. (2015) Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher Ré. 2015. Caffe con Troll: Shallow Ideas to Speed Up Deep Learning. arXiv.org (April 2015). arXiv:1504.04343v2
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. (2016), 770–778.
 Hinton (2010) Geoffrey Hinton. 2010. A practical guide to training restricted Boltzmann machines. Momentum 9, 1 (2010), 926.
 Iandola et al. (2016) Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. FireCaffe  NearLinear Acceleration of Deep Neural Network Training on Compute Clusters. CVPR (2016), 2592–2600.
 Ioffe (2017) Sergey Ioffe. 2017. Batch Renormalization: Towards Reducing Minibatch Dependence in BatchNormalized Models. arXiv.org (Feb. 2017). arXiv:1702.03275v1
 James, Olivier, Frédéric, Pascal, and Razvan (James et al.) B James, B Olivier, B Frédéric, L Pascal, and P Razvan. Theano: a CPU and GPU math expression compiler.
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia. 675–678.
 Krizhevsky (2014) Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).
 Krizhevsky et al. (2010) Alex Krizhevsky, Geoffrey E Hinton, and others. 2010. Factored 3way restricted boltzmann machines for modeling natural images. (2010), 621–628.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf
 Lavin and Gray (2016) Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013–4021.
 LeCun et al. (1998) Y LeCun, L Bottou, and Y Bengio. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11, 2278–2324.
 Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and BorYiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. OSDI (2014).
 Liu et al. (2011) Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM Trans. Intell. Syst. Technol. 2, 3, Article 26 (May 2011), 18 pages. https://doi.org/10.1145/1961189.1961198
 Mathieu et al. (2013) Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR abs/1312.5851 cs.CV (2013).

Nemhauser and
Wolsey (1988)
George L Nemhauser and
Laurence A Wolsey. 1988.
Integer programming and combinatorial optimization.
Wiley, Chichester. GL Nemhauser, MWP Savelsbergh, GS Sigismondi (1992). Constraint Classification for Mixed Integer Programming Formulations. COAL Bulletin 20 (1988), 8–12.  Ng (2016) Andrew Y. Ng. 2016. The Nuts and Bolts of Machine Learning. (2016). https://nips.cc/Conferences/2010/Schedule?showEvent=1986 NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
 Niu et al. (2011) Feng Niu, Benjamin Recht, Christopher Ré, and Stephen J Hogwild Wright. 2011. A lockfree approach to parallelizing stochastic gradient descent. arXiv preprint. arXiv preprint arXiv:1106.5730 (2011).
 Polyak (1964) Boris T Polyak. 1964. Some methods of speeding up the convergence of iteration methods. U. S. S. R. Comput. Math. and Math. Phys. 4, 5 (1964), 1–17.
 Shi et al. (2016) Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking StateoftheArt Deep Learning Software Tools. arXiv.org (Aug. 2016). arXiv:1608.07249v5
 Szegedy et al. (2016) Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inceptionv4, InceptionResNet and the Impact of Residual Connections on Learning. CoRR abs/1602.07261 (2016). http://arxiv.org/abs/1602.07261
 Vasilache et al. (2014) Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).
 Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. arXiv.org (Sept. 2014). arXiv:1409.2329v5
 Zhang et al. (2015) Hao Zhang, Zhiting Hu, Jinliang Wei, Pengtao Xie, Gunhee Kim, Qirong Ho, and Eric Xing. 2015. Poseidon: A system architecture for efficient gpubased deep learning on multiple machines. arXiv preprint arXiv:1512.06216 (2015).
 Zheng et al. (2015) Z Zheng, W Jiang, G Wu, and E Y Chang. 2015. SpeeDO: Parallelizing stochastic gradient descent for deep convolutional neural network. NIPS Workshop on Learning Systems (2015).
 Zinkevich et al. (2010) Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Parallelized stochastic gradient descent. (2010), 2595–2603.
Comments
There are no comments yet.