Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

03/27/2019 ∙ by Ruben Mayer, et al. ∙ Technische Universität München 10

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multi-tenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.



There are no comments yet.


page 1

page 3

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep Learning (DL) has recently gained a lot of attention due to its superior performance in tasks like speech recognition (6296526, ; Huang:2014:HPS:2541883.2500887, ), optical character recognition (Borisyuk:2018:RLS:3219819.3219861, ), and object detection (lecun2015deep, ). The application of DL poses a tremendous potential in numerous areas like medical image analysis (e.g., breast cancer metastases detection) (LITJENS201760, ), machine translation (johnson2017google, ), image restoration (e.g., automatically colorize grayscale images) (Iizuka:2016:LCJ:2897824.2925974, ), image captioning (Hossain:2019:CSD:3303862.3295748, )

(i.e., creating a description of an image), and as agents in reinforcement learning systems that map state-action pairs to expected rewards 

(8103164, )

. In DL, a network of mathematical operators is trained with classified data sets until the weights of the model are ready to make correct predictions on new, unclassified data. Major companies and open source initiatives have developed powerful DL frameworks such as TensorFlow 

(199317, ) and MXNet (mxnet, ) that automatically manage the execution of large DL models developed by domain experts.

One of the driving factors of the success of DL is the scale of training in three dimensions. The first dimension of scale is the size and complexity of the models themselves. Starting from simple, shallow neural networks, with increasing depth and more sophisticated model architectures, new breakthroughs in model accuracy were achieved (cirecsan2010deep, ; dean2012large, ). The second dimension of scale is the amount of training data. The model accuracy can, to a large extent, be improved by feeding more training data into the model (4804817, ; DBLP:journals/corr/abs-1712-00409, ). In practice, it is reported that 10s to 100s of Terabyte (TB) of training data are used in the training of a DL model (186212, ; 8327042, ). The third dimension is the scale of the infrastructure. The availability of programmable highly-parallel hardware, especially graphics processing units (GPUs), is a key-enabler to training large models with a lot of training data in a short time (cirecsan2010deep, ; Zhang:2017:PEC:3154690.3154708, ).

Our survey is focused on challenges that arise when managing a large, distributed infrastructure for DL. Hosting a large amount of DL models that are trained with large amounts of training data is challenging. This includes questions of parallelization, resource scheduling and elasticity, data management and portability. This field is now in rapid development, with contributions from diverse research communities such as distributed and networked systems, data management, and machine learning. At the same time, we see a number of open source DL frameworks and orchestration systems emerging 

(199317, ; mxnet_learningsys2016, ; Peng:2018:OED:3190508.3190517, ; 222611, ). In this survey, we bring together, classify and compare the huge body of work on distributed infrastructures for DL from the different communities that contribute to this area. Furthermore, we provide an overview and comparison of the existing open-source DL frameworks and tools that put distributed DL into practice. Finally, we highlight and discuss open research challenges in this field.

1.1. Complementary Surveys

There are a number of surveys on DL that are complementary to ours. Deng (deng_2014, ) provides a general survey on DL architectures, algorithms and applications. LeCunn et al. provide a general overview of DL (lecun2015deep, ). Schmidhuber (SCHMIDHUBER201585, ) provides a comprehensive survey on the history and technology of DL. Pouyanfar et al. (Pouyanfar:2018:SDL:3271482.3234150, ) review current applications of DL. Luo (Luo2016, ) provides a review on hyper-parameter selection strategies in ML training, including training of neural networks. Those surveys cover general techniques of DL, but are not focused on scalability and distributed systems for DL.

Ben-Nun and Hoefler (DBLP:journals/corr/abs-1802-09941, ) provide an analysis of concurrency in parallel and distributed DL training. Chen and Lin (6817512, ) provide a survey on DL challenges and perspectives with regard to Big Data (i.e., high data volumes, variety and velocity). Erickson et al. (Erickson2017, ) provide a short overview of DL frameworks. Our survey takes a much broader view on distributed DL systems. In particular, we include topics such as resource scheduling, multi-tenancy and data management. Those aspects of scalable DL systems become particularly important when dealing with large models and huge amounts of training data in a shared cluster or cloud environment. Furthermore, we analyze current open-source DL frameworks and tools in depth and relate them to the research on parallel and distributed DL training. This has not been done in the existing surveys. Pouyanfar et al. (Pouyanfar:2018:SDL:3271482.3234150, ) analyze and compare DL frameworks, but not with regard to parallelization and distribution.

1.2. Structure of the Survey

We structure our survey as follows. In Section 2, we introduce DL and provide the foundations for the further discussion of DL systems. In Section 3, we discuss the challenges and techniques of scalable DL in detail. We cover four important aspects: Distributed infrastructures, parallelization of DL training, resource scheduling and data management. In Section 4, we analyze and compare 11 open source DL frameworks and tools that put scalable DL into practice. Finally, in Section 5, we conclude this survey and provide an outlook on current trends and open problems in the field that deserve further research.

2. Foundations

In this section, we introduce fundamental concepts and notations that are relevant throughout the survey.

2.1. Context of Deep Learning

Artificial intelligence (AI) has been a long held vision of building and programming computers in such a way that they can independently (i.e., without human involvement) solve complex problems (searle1980minds, ; qai-book, ). In the most recent past, immense practical achievements of AI have been made in many different fields, such as knowledge reasoning (NIPS2013_5028, ), planning (NIPS2017_7055, ), natural language processing (8416973, )

, computer vision 

(Szegedy_2016_CVPR, ), and robotics (doi:10.1177/0278364914549607, ). Among the methods developed in AI research are cybernetics, symbolic and sub-symbolic, and statistical machine learning (ML). Deep Learning (DL) is a specific approach of ML, which deals with the training of deep neural networks. The relationship between AI, ML and DL is visualized in Figure 2.

Figure 1. Relationship between AI, ML and DL.
Figure 2.

Schematic of a multi-layer perceptron (MLP).

2.2. Deep Neural Networks

A neural network (NN) is a network of interconnected artificial neurons

. An artificial neuron is a mathematical function that transforms a set of input signals to an output signal. In doing so, it builds a

weighted sum of the input signals and then applies an activation function

(or transfer function) on it. Many different activation functions have been proposed, such as a step function, linear combination, sigmoid, or the rectifier function (rectified linear unit (ReLU)) 

(Nair:2010:RLU:3104322.3104425, ). When the activation function is a step function, a threshold needs to be defined. Additionally, a bias can be added, i.e., a fixed input signal that is not the output signal of a preceding neuron. By layering the neurons and connecting them from an input layer to an output layer, the overall network represents a function that maps the input signals that go into the input layer (layer ) to an output signal that leaves the output layer (layer ). This function is a concatenation of the functions of the single layers : .

The goal of is to approximate a target function , e.g., a classifier that maps an input to a category . The problem tackled by the training process is to adjust the set of parameters , i.e., the weights, biases and thresholds, in all of the artificial neurons in such a way that the output of approximates the output of with the best possible accuracy. To do so, the network is fed with (possibly noisy) training samples that are attached with labels , such that . In a training step, a sample is fed into the neural network’s input layer, and an output signal is produced. The deviation of from the given label is the error or loss of the network. To compute that deviation, given and , a loss function is applied (AAAI1714759, ), such as mean squared error, cross-entropy, or semantic loss (pmlr-v80-xu18h, ). The goal of training is to adjust the parameters such that the loss is minimized. To do so, the gradient

of the loss function is computed w.r.t. the weights. By applying the chain rule of differentiation, the gradient for each layer can be computed from the output layer to the input layer. This procedure is known as

back-propagation (rumelhart1986learning, )

. There are different gradient descent algorithms applied in DL, such as stochastic gradient descent (SGD) 

(bottou1991stochastic, ), AdaGrad (Duchi:2011:ASM:1953048.2021068, ), Adam (DBLP:journals/corr/KingmaB14, ), Nadam (dozat2016incorporating, ) and others. A detailed review on gradient descent algorithms is provided by Ruder (DBLP:journals/corr/Ruder16, ). In the training process, instead of single training samples, mini-batches of training data are used in each iteration. This has the advantage of increased parallelism in the training process: The output of the network can be computed for a whole batch of training samples in parallel. However, choosing too large mini-batch sizes may deteriorate the model accuracy and increases the memory footprint of the training process (DBLP:journals/corr/abs-1804-07612, ). The parameters of the training process itself, i.e., the loss function, gradient descent algorithm, activation function, step size, and size of the mini-batches are called hyper-parameters.

2.3. Neural Network Architectures

The simplest way of organizing a DNN is by using multiple fully-connected layers of neurons, i.e., each neuron in a layer is connected to each neuron in the subsequent layer. This architecture is also referred to as multi-layer perceptron (MLP). However, MLPs have limitations (726791, ; graves2014towards, ). First of all, MLPs have a large number of weights, which requires a large number of training samples and occupies a large amount of memory. Second, MLPs are not robust against geometric translations and local distortions of the inputs. For instance, in the detection of hand-written digits from images, the same digit will be written slightly different in different images (726791, ). Third, MLPs are agnostic to the topology of the input, i.e., the order of the input signals is not taken into account. However, in many cases, there is a local structure in the input data. For instance, in images, pixels that are nearby are likely to be correlated (726791, ), and in speech recognition, previous and future context of the input data is particularly relevant to detect a spoken word (graves2014towards, ). To overcome the shortcomings of MLPs, more sophisticated neural network architectures have been proposed. Here, we briefly review the most prominent ones.

Convolutional neural networks (CNNs) (726791, ) introduce convolutional layers and sub-sampling layers. Different from fully-connected layers as in MLPs, convolutional layers are only connected to sub-areas of their respective previous layers, pursuing the concept of local receptive fields which is inspired by biology (hubel1962receptive, ). A convolutional layer is composed of multiple planes, where in each plane, all neurons share the same weights (weight sharing). Finally, convolutional layers alternate with sub-sampling

layers to reduce the spacial resolution of the feature map. Besides feed-forward networks (where the output of neurons does not loop back to their own input), loop-backs are useful for many use-cases. For instance, in natural language processing, the meaning of one word in a sentence may depend on the meaning of a previously seen word in the same (or even a previous) sentence. To model such phenomena in DL networks, recurrent neural networks (RNNs) have been proposed. Long-short term memory (LSTM) units are special units of an RNN to overcome issues of exploding or vanishing gradients when training RNNs 

(doi:10.1162/neco.1997.9.8.1735, )

. Autoencoders 

(Hinton504, ) are NNs which are used in order to learn efficient encodings (i.e., compressed representations) that extract significant features from the training data. Their architecture consists of an encoder, a code, and a decoder, each consisting of layers of neurons, where the output layer of the network has the same number of neurons as the input layer, but the code, which is exactly between encoding and decoding layers, has much fewer neurons. In generative adversarial networks (GANs) (goodfellow2014generative, ), two NNs are aligned with each other, namely, a generative and a discriminative NN. Another recent architecture of NNs are graph neural networks (DBLP:journals/corr/abs-1901-00596, ), where graph-structured representations are learned, as opposed to representations in the Euclidian space (as in CNNs).

3. Distributed Deep Learning

Training large DL models with vast amounts of training data is a non-trivial task. Often, it is performed in a distributed infrastructure of multiple compute nodes, each of which may be equipped with multiple GPUs. This brings a number of challenges. First of all, the processing resources must be effectively used, i.e., one must avoid stalling of costly GPU resources due to communication bottlenecks. Second, the compute, storage and network resources are typically shared among different users or training processes to reduce costs and provide elasticity (i.e., the cloud computing paradigm (Armbrust:2010:VCC:1721654.1721672, )). To tackle those challenges in DL, research at the intersection of computing systems and DL is receiving growing attention (186212, ; 199317, ; Cui:2016:GSD:2901318.2901323, ; Jeong:2018:IED:3190508.3190530, ; Peng:2018:OED:3190508.3190517, ; 222611, ). This becomes evident with new workshops and conferences arising which particularly focus on DL/ML systems research, such as the Conference on Systems and Machine Learning (SysML)111 However, also established communities such as the data management community are turning their attention toward DL/ML systems (Wang:2016:DMD:3003665.3003669, ; Kumar:2017:DMM:3035918.3054775, ). In this section, we discuss the main directions of DL systems research in depth. We introduce the main research challenges, discuss state-of-the-art approaches, and analyze open research problems that deserve further attention.

Figure 3. Distributed Deep Learning: Overview.

Section Overview

Figure 3 provides an overview of the topics addressed in this section. On the lowest level, we address the infrastructure used in large DL systems in Section 3.1. This regards recent trends in the hardware being used, networking architectures, as well as low-level software architecture for DL systems. On a higher level, we discuss challenges and solutions with regard to parallelization of DL training in Section 3.2. This regards general parallelization methods such as model and data parallelism, as well as issues in the synchronization of parallel training processes and the optimization of communication in parallel training. To map the components of a parallel DL system to the infrastructure, scheduling is applied. In Section 3.3, we discuss the scheduling problem in single-tenant as well as multi-tenant scenarios. One of the big challenges of large-scale DL is the size of training data and DL models that need to be maintained. In Section 3.4, we discuss challenges and approaches of data management in DL.

3.1. Distributed Infrastructures

To understand the challenges on parallelization, scheduling and data management for DL, we first take a deeper look at the distributed infrastructure on which DL training is performed. We divide the existing work into two categories: Hardware innovations and data-center scale infrastructure applied to real DL workloads.

3.1.1. Hardware Components for DL

While early DL deployments were based on clusters of multi-core CPUs, scalability limitations pushed the efforts to exploiting highly-parallel hardware, and even developing special-purpose hardware dedicated to DL training and serving. The performance benefits of GPUs compared to CPU depend on many factors, such as whether the job is processing-bound or memory-bound, the efficiency of the implementation, as well as the hardware itself (Lee:2010:DGV:1815961.1816021, ). Both CPUs and GPUs hardware innovates at a fast pace, which makes comparisons difficult and short-living. Nevertheless, state-of-the-art infrastructures for DL typically comprise GPUs to accelerate the training and inference process. Hardware vendors offer specialized servers and even workstations for DL, such as NVIDIA DGX station (dgx, ).

Besides GPU-centric DL, other forms of hardware acceleration have been proposed, such as field-programmable gate arrays (FPGAs) (accelerating-deep-convolutional-neural-networks-using-specialized-hardware, )

. Tensor processing units (TPUs) are application-specific integrated circuits (ASICs) developed by Google that speed-up DL training and inference significantly 

(8192463, ). TPUs are proprietary and not commercially available, but can be rented via the Google cloud services.

Besides such more traditional forms of computing architectures that follow the von-Neumann architecture by separating memory and processing units, there are research efforts to develop novel in-memory computing architectures (also called neuromorphic hardware (boybat2018neuromorphic, )). Those efforts are inspired by the physiology of the brain, which is very different from the way traditional von-Neumann computing architectures work. However, as of today, neuromorphic hardware architectures are still in the experimental stage and not widely available.

Some papers have highlighted the need for efficient implementations of DL kernels, e.g., by exploiting SIMD (single instruction, multiple data) instructions (37631, ; Lee:2010:DGV:1815961.1816021, ) and awareness of non-uniform memory access (NUMA) (Roy:2018:NND:3212710.3199605, ). This raises the need for re-usable, optimized kernel implementations of the most relevant operations in DNN training. One of the major GPU-specific libraries is cuDNN, a library with DL primitives for GPUs (DBLP:journals/corr/ChetlurWVCTCS14, ). The NVIDIA Collective Communications Library (NCCL) (nccl, ) provides multi-GPU and multi-node communication primitives and is optimized for PCIe and NVLink high-speed interconnects. DL frameworks often incorporate such low-level libraries to fully exploit the capabilities of the hardware infrastructure.

3.1.2. Large-scale Infrastructure for DL

A large-scale DL infrastructure is composed of many inter-connected hardware components that together build a warehouse-scale computer (Barroso:2009:DCI:1643608, ). In this subsection, we review current infrastructures as described by organizations that perform very large DL jobs, such as Facebook, Google, and Microsoft, as well as academic research.

Facebook describes its ML/DL infrastructure in a recent paper (8327042, ). They use both CPUs and GPUs for training, and rely on CPUs for inference. To do so, they build specialized CPU-based and GPU-based compute servers to serve their specific needs of training and inference. For training, GPUs are preferred, as they perform better; however, in their data centers, they have abundant capacities of readily-available CPUs, especially during off-peak hours, which they also exploit. For inference, they rely on CPUs, as GPU architectures are optimized for throughput over latency, but latency is a critical factor in inference. Interestingly, for inter-connecting training servers in distributed, data-parallel training, they rely on 50G Ethernet, and forego using specialized interconnects such as RDMA or NCCL (nccl, ).

Similarly to Facebook, Tencent employs a heterogeneous infrastructure with both CPUs and GPUs. Their deep learning system Mariana (Zou:2014:MTD:2733004.2733082, ) consists of three different frameworks that are optimized for different infrastructures and use cases.

Adam is a large-scale distributed system for DL at Microsoft (186212, ). It relies on a large number of commodity hardware CPU-servers to perform DL training. Besides many system-level optimizations, one of the hardware-centric features of Adam is that they partition DL models in such a way that the model layers fit in the L3 cache to improve training performance.

The paper on TensorFlow (199317, ), a scalable ML framework developed by Google, provides some insights into the infrastructure at Google. Overall, Google follows a different approach from Facebook and Microsoft when it comes to the DL infrastructure. First of all, they employ TPUs, which are custom ASICs, as opposed to only using commercial-off-the-shelf (COTS) hardware. Second, they exploit specialized interconnects and use multiple communication protocols, such as gRPC over TCP and RDMA over Converged Ethernet (RoCE)222RoCE is a network protocol that supports Ethernet as the underlying protocol for remote direct memory access (RDMA).. Distributed TensorFlow supports communication via the message passing interface (MPI) (DBLP:journals/corr/VishnuSD16, ).

In academic research, exploiting high-performance computing (HPC) infrastructures for DL training is topic with increasing importance. Coates et al. (Coates:2013:DLC:3042817.3043086, ) report using a cluster of 16 servers, each equipped with 2 quad-core CPUs and 4 GPUs, being interconnected by Infiniband. Different from Ethernet, Infiniband has high throughput and—more important—extremly low end-to-end latency (in the order of microseconds). Ben-Nun and Hoefler (DBLP:journals/corr/abs-1802-09941, ) also observe a trend to move towards HPC infrastructures in DL research.

Summing up, large-scale infrastructures in real-world deployments are highly heterogeneous. They do not only comprise GPU servers, but commonly also CPUs. Overall, we see a certain dominance of COTS hardware, just as it is also the case in other Big Data analytics workloads, such as batch processing (Dean:2008:MSD:1327452.1327492, ) and graph processing (Malewicz:2010:PSL:1807167.1807184, ). However, also custom hardware and HPC infrastructure is used, especially at Google and in academic research. In HPC infrastructures, we observe that the DL systems are specialized toward the target infrastructures to increase performance, e.g., regarding the communication protocols like RDMA, NCCL, and MPI.

Figure 4. Data parallelism.
Figure 5. Model parallelism.

3.2. Parallelization

Parallelization and distribution are essential in order to train large DL models with huge amounts of training data. We organize this section as follows. In Section 3.2.1, we introduce and discuss parallelization methods in DL, which are data parallelism, model parallelism, pipeline parallelism and hybrid forms. Among these, data parallelism has obtained most attention in research and industry. We thus discuss research challenges and optimizations for data-parallel DL in more detail in Section 3.2.2.

3.2.1. Parallelization Methods

DL comes with many possibilities for parallelization. Here, we introduce the three predominant parallelization methods in DL, namely data, model and pipeline parallelism, as well as hybrid forms of parallelism.

Data Parallelism

In data parallelism, a number of workers (machines or devices, e.g., GPUs) loads an identical copy of the DL model (cf. Figure 5). The training data is split into non-overlapping chunks and fed into the model replicas of the workers for training. Each worker performs the training on its chunk of training data, which leads to updates of the model parameters. Hence, the model parameters between the workers need to be synchronized. There are many challenges in the problem of parameter synchronization. We discuss those challenges and state-of-the-art approaches to tackle them in Section 3.2.2.

The main advantage of data parallelism is that it is applicable to any DL model architecture without further domain knowledge of the model. It scales well for operations that are compute-intensive, but have only few parameters, such as CNNs. However, data parallelism is limited for operations that have many parameters, as the parameter synchronization becomes the bottleneck (DBLP:journals/corr/abs-1807-05358, ; DBLP:journals/corr/Krizhevsky14, ). This problem could be alleviated by using larger batch sizes; however, this increases data staleness on the workers and leads to poor model convergence. A further limitation of data parallelism is that it does not help when the model size is too large to fit on a single device. It is worth to note that in many data parallel training schemes, it is assumed or required that the training data is independent and identically distributed (i.i.d.), so that parameter updates computed by the parallel workers can simply be summed up in order to compute the new global model parameters (7239545, ).

Model Parallelism

In model parallelism, the DL model is split, and each worker loads a different part of the DL model for training (cf. Figure 5

). The worker(s) that hold the input layer of the DL model are fed with the training data. In the forward pass, they compute their output signal which is propagated to the workers that hold the next layer of the DL model. In the backpropagation pass, gradients are computed starting at the workers that hold the output layer of the DL model, propagating to the workers that hold the input layers of the DL model.

A major challenge of model parallelism is how to split the model into partitions that are assigned to the parallel workers (Mayer:2017:TPS:3154842.3154843, ). A common approach to find a good model splitting is to use reinforcement learning (pmlr-v70-mirhoseini17a, ; mirhoseini2018a, ): Starting from some initial partitioning, permutations on that partitioning are performed, and performance is measured (e.g., for one training iteration). In case of an improvement, the permutation is maintained, and further permutations are performed, until the measured performance converges. Streaming rollout (NIPS2018_7659, ) is a specialized solution that only works for RNNs.

The main advantage of model parallelism is the reduced memory footprint. As the model is split, less memory is needed for each worker. This is useful when the complete model is too large to fit on a single device. This can be the case when the device consists of specialized hardware such as GPUs or TPUs. The disadvantages of model parallelism are in the heavy communication that is needed between workers. As DL models are hard to be split effectively, there may occur stalling of workers due to communication overhead and synchronization delays. Hence, increasing the degree of model parallelism does not necessarily lead to training speedup (pmlr-v70-mirhoseini17a, ).

Pipeline Parallelism
Figure 6. Pipeline parallelism. “B” - Backpropagation. Figure adapted and extended from Ref. (DBLP:journals/corr/abs-1811-06965, ).

Pipeline parallelism combines model parallelism with data parallelism. In pipeline parallelism, the model is split and each worker loads a different part of the DL model for training (cf. Figure 6). Further, the training data is split into micro batches. Now, every worker computes output signals for a set of micro-batches, immediately propagating them to the subsequent workers. In the same way, in the backpropagation pass, the workers compute gradients for their model partition for multiple micro-batches, immediately propagating them to preceding workers. By streaming multiple micro-batches through the forward and backpropagation pass in parallel, the utilization of workers can be significantly increased compare to pure model parallelism, where only one batch is processed at a time. At the same time, the advantages of model parallelism are maintained, as a single worker does not need to hold the complete model. Current approaches that support pipeline parallelism are GPipe (DBLP:journals/corr/abs-1811-06965, ) and PipeDream (harlappipedream, ; DBLP:journals/corr/abs-1806-03377, ).

Hybrid Parallelism

Often, DL models are complex and composed of many different layers that follow a completely different architecture which, in turn, requires different parallelization methods. Hence, hybrid approaches that mix data, model and pipeline parallelism are common.

Mesh-TensorFlow (shazeer2018mesh, ) is a language extension of TensorFlow that allows for combining data parallelism and model parallelism. In Mesh-TensorFlow, tensors can be split across a “mesh” of processors (such as CPUs, GPUs or TPUs) along any dimensions specified by the user. To achieve data parallelism, the user specifies to split a tensor across the data dimension; to achieve model parallelism, a tensor can be split across any of its dimensions / attributes.

There are a couple of papers that propose optimizations of parallelization that are manually designed by domain experts. Krizhevsky (DBLP:journals/corr/Krizhevsky14, )

proposed to apply data parallelism for convolutional and pooling layers, as those layers are compute-heavy and only have few parameters, and model parallelism for fully-connected layers, as they are light in computation, but have many parameters. In Google’s Neural Machine Translation System (GNMT) 

(DBLP:journals/corr/WuSCLNMKCGMKSJL16, ) that powers Google Translate, they apply data parallelism, but combine it with hand-crafted model parallelism for each model replica.

Beyond manually designed hybrid models, recently, automated optimization approaches have been developed. Jia et al. (pmlr-v80-jia18a, ) propose “layer-wise” parallelization. For each layer of a DNN, an optimal parallelization method is chosen along the tensors’ dimensions at the layer. To do so, they employ a cost model and a graph search algorithm on a reduced graph that models the solution space. FlexFlow by Jia et al. (DBLP:journals/corr/abs-1807-05358, ) is an automatic parallelization optimizer that employs an execution simulator. It optimizes parallelism across four dimensions, referred to as the SOAP space: the sample, operation, attribute and parameter dimension. The sample dimension refers to batches of training data and corresponds to data parallelism. The operation dimension refers to artificial neurons, the attribute dimension refers to the attributes of the tensors, and the parameter dimension refers to the weights and other model parameters. Together, the operation, attribute and parameter dimensions correspond to model parallelism (pmlr-v80-jia18a, ).

3.2.2. Optimizations for Data Parallelism

Parameter synchronization in data-parallel DL systems poses three major challenges. The first challenge is how to synchronize the parameters. Should the workers synchronize via a centralized architecture or in a decentralized manner? The second challenge is when to synchronize the parameters. Should the workers be forced to synchronize after each batch, or do we allow them more freedom to work with potentially stale parameters? The third challenge is how to minimize communication overhead for synchronization.

System Architecture
Figure 7. Parameter server architecture.
Figure 8. All-reduce architecture.

The system architecture describes how the parameters of the different replicas are synchronized. There are two main architectures: The centralized architecture with parameter server and the decentralized architecture without parameter server. We also discuss federated architectures.

(1) Centralized. In the (logically) centralized architecture, workers periodically report their computed parameters or parameter updates to a (set of) parameter server(s) (PSs) (cf. Figure 8). Roots of the PS architecture go back to the blackboard architecture (Smola:2010:APT:1920841.1920931, ) and MapReduce (Dean:2008:MSD:1327452.1327492, ), as Alex Smola reports (smola_on_ps, ). The PS architecture is the most prominent architecture of data parallel DL systems. A common approach is to use sharding of the model parameters and distribute the shards on multiple PSs which then can be updated in parallel (dean2012large, ). Among the systems that use a parameter server architecture are GeePS (Cui:2016:GSD:2901318.2901323, ), DistBelief (dean2012large, ), TensorFlow (199317, ), Project Adam (186212, ), Poseidon (Zhang:2017:PEC:3154690.3154708, ), SINGA (Ooi:2015:SDD:2733373.2807410, ), SparkNet (moritz2015sparknet, ) and the system by Yan et al. (Yan:2015:PMS:2783258.2783270, ).

(2) Decentralized. The decentralized architecture works without a PS. Instead, the workers exchange parameter updates directly via an allreduce operation. In doing so, the topology of the workers plays an important role. A fully connected network, where each worker communicates with each other worker, has a communication cost that is in with workers, so that communication becomes a bottleneck. A common alternative is to employ a ring topology (referred to as ring-allreduce). Horovod (DBLP:journals/corr/abs-1802-05799, ) from Uber uses NCCL to implement ring-allreduce. Baidu had one of the first proposals of using ring-allreduce for data parallel DL training (baidu-allreduce, ). The multi-GPU framework in Tencent’s Mariana DL system (Zou:2014:MTD:2733004.2733082, ) employs a similar linear topology for parameter exchange across workers. Other topologies that have been proposed are “Butterfly” (doi:10.1137/1.9781611972832.87, ), a tree (JMLR:v15:agarwal14a, ), and a graph that is built based on a Halton sequence (Li:2015:MDD:2741948.2741965, ). Wang et al. (Wang:2014:STC:2637166.2637231, ) propose a parameter sharing protocol that allows for arbitrary loop-free worker topologies that can also be dynamically changed at system run-time. The main drawback of alternative topologies, different from the fully connected topology, is that the propagation of parameter updates to all workers needs more time, as there may be multiple hops between a pair of workers.

The topology of the workers is not the only knob to reduce network load. Ako by Watcharapichat et al. (Watcharapichat:2016:ADD:2987550.2987586, ) employs a fully connected network of workers, but partitions the gradients that are exchanged between workers (partial gradient exchange). In each round of synchronization, each worker only sends a single partition of the gradients to every other worker; in particular, it may send different partitions to different workers. Clearly, the communication overhead depends both on the size of a partition (which itself depends on the number of partitions) as well as on the number of workers. The number of partitions is adapted automatically in such a way that the network bandwidth remains constant independently of the number of workers.

Comparison to centralized architecture.  The advantages of the decentralized architecture compared to the centralized one are as follows (Li:2015:MDD:2741948.2741965, ). By using the decentralized architecture, one avoids the need to deal with the inconveniences of implementing and tuning a parameter server. This is not only a matter of the complexity of the system code, but also eases the deployment. One does not need to plan which resources to allocate for the parameter servers and for the workers. A further advantage is that fault tolerance can be achieved more easily, because there is no single point of failure such as the parameter server. When a node in the decentralized architecture fails, other nodes can easily take over its workload and the training proceeds without interruptions. Heavy-weight checkpointing of the parameter server state is not necessary.

The decentralized architecture also has disadvantages. First and foremost, communication in the decentralized architecture increases quadratically with the number of workers, if no counter-measures are taken. As discussed above, those counter-measures, such as changing the topology or partitioning the gradients, induce new complexities and trade-offs. Overall, there is no silver bullet for the problem of synchronizing parallel parameter updates.

A case study by Lian et al. (lian2017can, ) indicates that the decentralized architecture can, under certain conditions, perform better than the centralized architecture if the communication network is slow. However, their study is limited to synchronous parameter updates and the centralized architecture they compare to employs only a single parameter server. In such a setting, the network connecting the single central parameter server quickly becomes the bottleneck. Similar results have been reported by Iandola et al. (Iandola_2016_CVPR, ) who also prefer a tree-structured allreduce architecture to a single parameter server.

(3) Federated. Both the centralized and the decentralized architecture assume a controlled environment (such as a data center), a balanced and i.i.d. distribution of the training data to the workers, and a network with homogeneous and high bandwidth. In contrast to this, federated learning (45648, ) evolves around a scenario where the training data is kept locally on users’ mobile devices, and a global model is trained based on updates that the users compute on their local devices. That way, training data, which may contain privacy-sensitive information, can be completely kept locally, which can also decrease the bandwidth requirements between the mobile devices and the central data center.

The low and asynchronous bandwidth (i.e., the uplink is usually much slower than the downlink) of a mobile device’s Internet connection makes it impossible to repeatedly upload the updated parameters of a large model to a centralized parameter server or to decentralized peer nodes. Konec̆ný et al. (45648, ) study different forms of parameter sampling and compression to mitigate this problem. McMahan et al. (44822, ) propose the federated averaging algorithm for reducing the parameter updates. Their algorithm is round-based: In each round, a fraction of the clients is selected. Each selected client computes the gradient of the loss function over all the training data that it holds. To reach convergence, it is important that the model instances on the client start from the same random initialization. Finally, a central server aggregates the gradients from the selected clients. In a comparative performance study by Nilsson et al. (Nilsson:2018:PEF:3286490.3286559, ), the authors show that federated averaging is the best algorithm for federated learning, and is practically equivalent to the centralized architecture when i.i.d. training data is used. However, in the non-i.i.d. case, the centralized approach performs better than federated averaging.

Synchronization Model System Architecture
Ref. Name Sync. Bound. Async. Async. Centra- lized Decen- tralized Fede- rated Year Main Concepts
(recht2011hogwild, ) Hogwild x x 2011 Lock-free updates
(dean2012large, ) Downpour SGD x x 2012 Parameter sharding, asynchronous SGD
(181983, ) Cipar et al. x x 2013 Introduces Stale Synchronous Parallel (SSP)
(noel2014dogwild, ) Dogwild x x 2014 Distributed Hogwild (recht2011hogwild, )
(Cui:2014:EBS:2643634.2643639, ) Cui et al. x x 2014 Applies SSP (181983, )
(186214, ) Li et al. x x x x 2014 Flexible consistency
(Dai:2015:HDM:2887007.2887019, ) Dai et al. x x 2015 Introduces Eager SSP
(Li:2015:MDD:2741948.2741965, ) MALT x x 2015 Shared memory abstraction
(7837887, ) Hogwild++ x x 2016 NUMA-aware Hogwild (recht2011hogwild, )
(Cui:2016:GSD:2901318.2901323, ) GeePS x x x x 2016 GPU-specialized PS
(Jiang:2017:HDP:3035918.3035933, ) Jiang et al. x x 2017 Dynamic learning rates on SSP (181983, )
(Wang:2018:ASP:3274808.3274828, ) A-BSP x x x 2018 Aggressive synchronization
(DBLP:journals/corr/abs-1901-02244, ) CROSS-BOW x x 2019 Synchronous model averaging
Table 1. Categorization of approaches on parameter synchronization in data-parallel training.

The question when to synchronize the parameters between the parallel workers has received a lot of attention. Overall, there are three different main approaches: Synchronous, bounded asynchronous, and asynchronous training. We discuss those approaches and related literature in the following. Table 1 provides an overview and categorization of the most relevant publications.

(1) Synchronous. In synchronous training, after each iteration (processing of a batch), the workers synchronize their parameter updates. Such a strict model can be implemented by well-known abstractions such as the Bulk Synchronous Parallel (BSP) model (Valiant:1990:BMP:79173.79181, ), which are in many cases already available in data analytics platforms such as Hadoop / MapReduce (Dean:2008:MSD:1327452.1327492, ), Spark (180560, ; JMLR:v17:15-237, ) or Pregel (Malewicz:2010:PSL:1807167.1807184, ). The advantage of strict synchronization is that reasoning about the model convergence is easier. However, strict synchronization makes the training process prone to the straggler problem, where the slowest worker slows down all others (181983, ).

GeePS (Cui:2016:GSD:2901318.2901323, ) by Cui et al. is a parameter server implementation that is tailored to GPUs. This includes a couple of optimizations such as pre-built indexes, caching, data staging and memory management. While GeePS supports synchronous, bounded asynchronous and asynchronous parameter synchronization, it is designed to minimize the straggler problem on GPUs, and hence, achieves best convergence speed when using the synchronous approach. Wang et al. (Wang:2018:ASP:3274808.3274828, ) propose an aggressive synchronization scheme that is based on BSP, named A-BSP. Different from BSP, A-BSP allows the fastest task to fetch current updates generated by the other (straggler) tasks that have only partially processed their input data. The authors have implemented A-BSP both on Spark (180560, ; JMLR:v17:15-237, ) as well as on the Petuum system (7239545, ). CROSSBOW (DBLP:journals/corr/abs-1901-02244, ) by Koliousis et al. introduces synchronous model averaging (SMA). In SMA, data-parallel workers access a global average model in order to coordinate with each other. In particular, the workers independently learn their model replica on their respective shard of the training data, but correct their model parameters according to the difference of their local models to the global average model.

(2) Bounded asynchronous. Asynchronous training makes use of the approximate nature of ML/DL training. Recall, that DL models are mathematical functions that approximate the target function as good as possible (cf. Section 2.2). Hence, small deviations and non-determinism in the training process do not necessarily harm the model accuracy. This is different from “strict” problems in data analytics, such as database queries, which are required to return a deterministic result. In bounded asynchronous training, workers may train on stale parameters, but the staleness is bounded (181983, ). Bounded staleness allows for a mathematical analysis and proof of the model convergence properties. The bound allows the workers for more freedom in making training progress independently from each other, which mitigates the straggler problem to some extent and increases throughput.

Cipar et al. introduced the Stale Synchronous Parallel (SSP) model (181983, ). Different from the BSP model, SSP allows for bounded staleness of the workers, i.e., there may be a delay between a worker updating the parameters and the effects of that update being visible to other workers. This delay is given in terms of a number of iterations. A follow-up paper by Cui et al. (Cui:2014:EBS:2643634.2643639, ) proposes an implementation of SSP for ML jobs. Dai et al. (Dai:2015:HDM:2887007.2887019, ) perform a theoretical analysis of SSP, comparing it against a theoretically optimal (but practically not implementable) approach. In the course of their analysis, they propose Eager SSP (ESSP), which is a novel implementation of the SSP model. In ESSP, workers eagerly pull updates from the parameter servers, as opposed to SSP where updates are only pulled when the worker state becomes too stale. ESSP is implemented in the Petuum system (7239545, ). The parameter server by Li et al. (186214, ) has a flexible consistency model that also supports bounded delays. Jiang et al. (Jiang:2017:HDP:3035918.3035933, ) propose to use dynamic learning rates on top of SSP to account for heterogeneous workers. Depending on a worker’s speed, its learning rate is adapted such that stale updates have a less significant effect on the global parameters than fresh updates.

(3) Asynchronous. In asynchronous training, workers update their model completely independently from each other. There are no guarantees on a staleness bound, i.e., a worker may train on an arbitrarily stale model. This makes it hard to mathematically reason about the model convergence. However, on the other hand side, it provides the workers the greatest possible flexibility in their training process, completely avoiding all straggler problems.

Hogwild (recht2011hogwild, ) by Recht et al. is an asynchronous implementation of parallel SGD. The parameter update scheme of Hogwild grants the workers access to shared memory without any locks, i.e., workers can overwrite each other’s updates of the model parameters. This seems dangerous due to the lost update problem: New model parameters written by one worker could directly be overwritten by another worker and, hence, would not have any effect. However, the authors show that as long as the updates of the single workers only modify small parts of the model, Hogwild achieves nearly optimal convergence. By foregoing locks, Hogwild performs by an order of magnitude faster than update schemes that lock the model parameters before each update. The Hogwild scheme has been successfully applied to the training of neural networks (ParallelizationofNeuralNetworkTrainingforNLPwithHogwild, ). Dogwild (noel2014dogwild, ) by Noel and Osindero is a distributed implementation of Hogwild. The authors report that using UDP congested the network stack, while using TCP did not fully utilize the communication bandwidth and also caused latency spikes, so that they use raw sockets instead. Hogwild++ (7837887, ) by Zhang et al. is an adaptation of Hogwild to NUMA-based memory architectures. Downpour SGD (dean2012large, ) by Dean et al. is an asynchronous SGD procedure tailored to large clusters of commodity machines. Among the main concepts of Downpour SGD are the sharded parameter server and the application of adaptive learning rates (Duchi:2011:ASM:1953048.2021068, ). Different from Hogwild, which is lock-free, Downpour SGD uses lock-guarded parameter increments. MALT (Li:2015:MDD:2741948.2741965, )

by Li et al. is an asynchronous ML framework that follows the decentralized architecture. It provides a shared memory abstraction for the workers that provides a scatter/gather interface as well as a higher-level vector object library.

Communication Optimization Synchronization Model System Architecture
Ref. Name Preci- sion Com- press. Comm. Sched. Sync. Bound. Async. Async. Centra- lized Decen- tralized Fede- rated Year
(seide-compress-gradients, ) Seide et al. x x x 2014
(186214, ) Li et al. x x x x x 2014
(Gupta:2015:DLL:3045118.3045303, ) Gupta et al. x x x x x x x 2015
(Wei:2015:MCC:2806777.2806778, ) Bösen x x x 2015
(190634, ) MLNet x x x x 2015
(DBLP:journals/corr/ZhouNZWWZ16, ) DoReFa-Net x x x x x x x 2016
(qsgd-alistarh, ) QSGD x x x x x 2017
(wen2017terngrad, ) TernGrad x x x 2017
(lin2018deep, ) Lin et al. x x x x 2018
(216799, ) eSGD x x x 2018
(DBLP:journals/corr/abs-1803-03383, ) HALP x x x x x x x 2018
(TicTac, ) TicTac x x x 2019
Table 2. Categorization of approaches on efficient communication in data-parallel training.

Synchronizing the model replicas in data-parallel training requires communication between workers and between workers and parameter servers (in the centralized architecture). Such communication can easily become the bottleneck of the overall training process. Research on efficient communication methods tries to mitigate this problem. We identified three main approaches for communication efficiency: (1) Reducing the model precision, (2) compressing the model updates, and (3) improving the communication scheduling. The current landscape of communication approaches is categorized in Table 2. In the following, we provide a detailed description of the approaches.

(1) Reducing the model precision. Reducing the precision of the parameters of the model saves communication bandwidth when parameter updates need to be transferred over the network. Additionally, it reduces the model size, which can be useful when the model is deployed on resource-constrained hardware such as GPUs. Precision reduction can be achieved by reducing the precision of the parameters’ data types, e.g., from double precision to single floating point precision or even less.

Gupta et al. (Gupta:2015:DLL:3045118.3045303, )

limited the numerical precision of DL models to 16-bit fixed-point arithmetic. They found that when applying stochastic rounding as opposed to the common round-to-nearest method, the scheme with limited precision achieves nearly the same model accuracy as when applying the traditional 32-bit floating point arithmetic that is typically used in DL. This allows for reducing the model size by half. When applied to a data-parallel DL system, this will also reduce the network bandwidth needed for communicating parameter updates between workers and parameter servers; the approach itself does not depend on a specific synchronization method or parallel architecture. DoReFa-Net 

(DBLP:journals/corr/ZhouNZWWZ16, ) by Zhou et al. focuses on CNNs. Their main idea is to reduce the numerical precision of weights, activations and gradients to different bit-widths. They report to use 1-bit weights, 2-bit acitvations and 6-bit gradients on the AlexNet CNN (Krizhevsky:2012:ICD:2999134.2999257, ) and still reach an accuracy that is competitive to a 32-bit representation. High-accuracy low-precision (HALP) by De Sa et al. (DBLP:journals/corr/abs-1803-03383, )

is an algorithm that combines two optimization techniques in order to reach high model accuracy despite of limited parameter precision. First, they use stochastic variance-reduced gradient (SVRG) 

(Johnson:2013:ASG:2999611.2999647, ) to reduce noise from gradient variance. Second, to reduce noise from parameter quantization, they introduce a new technique called bit centering, i.e., re-centering and re-scaling of the fixed-point representation of the parameters as the model converges. Same as Gupta et al. (Gupta:2015:DLL:3045118.3045303, ), they rely on stochastic rounding for quantization.

(2) Compressing the model updates. The model updates communicated between workers and between workers and parameter servers can be compressed. Lossless compression is limited in the achievable compression rate, as redundancy in the parameter updates is typically limited. Instead, lossy compression is applied. The main methods in the literature are gradient quantization (reducing the number of bits per gradient) and gradient sparsification (communicating only important gradients that have a significant value).

Seide et al. (seide-compress-gradients, ) report on quantizing the gradients in a speech DNN to one single bit. To still achieve high accuracy, they propose a technique called error-feedback. In error-feedback, when quantizing gradients, they save the induced quantization error and add it into the respective next batch gradient before its quantization. Hence, the gradients’ information is not lost by quantization, but all gradients are eventually added into the model. TernGrad (wen2017terngrad, ) by Wen et al. introduces ternary gradients

, i.e., the gradient can have the value -1, 0, or 1. To improve on the model accuracy, they propose layer-wise ternarizing (i.e., using a different quantization for each layer) and gradient clipping (i.e., limit the magnitude of each gradient before quantizing it). QSGD 

(qsgd-alistarh, ) by Alistarh et al. follows a similar approach. They apply stochastic rounding (cf. Gupta et al. (Gupta:2015:DLL:3045118.3045303, ) and De Sal et al. (DBLP:journals/corr/abs-1803-03383, )) and statistical encoding; the key idea of the latter is that not all values are equally likely which is exploited in the encoding scheme.

Besides quantization, another common technique is gradient sparsification. It is based on the observation that in the training process, many gradients are very small (i.e., have a value close to 0) and do not contribute much to the training. By leaving out gradients with insignificant values, the communication volume can be reduced. The parameter server by Li et al. (186214, ) allows for gradient sparsification via user-defined filters. eSGD (216799, ) is a gradient sparsification approach for federated architectures. Lin et al. (lin2018deep, ) propose a gradient sparsification approach that is based on a threshold. Only gradients larger than the threshold are transmitted. The rest of the gradients are accumulated until the threshold is reached. This is similar to the error-feedback that Seide et al. (seide-compress-gradients, ) proposed for quantization. Lin et al. combine their sparsification approach with momentum correction to mitigate issues introduced by the transmission of accumulated small gradients. Further, they apply gradient clipping.

(3) Communication scheduling. Communication patterns in data-parallel DL are typically bursty, especially in strictly synchronous systems: All workers may share their updated parameters at the same time with their peer workers or parameter servers. To prevent that the network bandwidth is exceeded and communication is delayed, the communication of the different workers can be scheduled such that it does not overlap. Furthermore, when bandwidth is constrained, but too many parameter updates are to be sent, communication scheduling can prioritize specific messages over others, e.g., depending on freshness or on significance for the model convergence.

Bösen (Wei:2015:MCC:2806777.2806778, ) by Wei et al. maximizes network communication efficiency by prioritizing updates that are most significant to the model convergence. TicTac (TicTac, )

by Hashemi et al. is a system for communication scheduling in synchronous centralized architectures. They observe that in many ML/DL systems such as TensorFlow and PyTorch, parameters are transmitted randomly in the training and inference process. This results in high variance in iteration time, which slows down the process. To overcome that problem, TicTac enforces a schedule of network transfers that optimizes the iteration time. MLNet 

(190634, ) by Mai et al. is a communication layer for centralized data-parallel ML. They combine a tree-shaped communication overlay with traffic control and prioritization to mitigate network bottlenecks.

3.3. Scheduling and Elasticity

In this section, we analyze the scheduling problem in DL, i.e., how to map the (parallel) DL training processes (cf. Section 3.2) to the processing nodes in the distributed infrastructure (cf. Section 3.1). In particular, we regard three different aspects of scheduling in DL. First, we regard the single-tenant case (Section 3.3.1): How to map the processes of a single training job (e.g., workers and parameter servers) to the available infrastructure? In case that mapping is dynamic, and we can change the number of training processes (e.g., number of workers and number of parameter servers) as well as the infrastructure (e.g., number of compute nodes), we also talk about elasticity in the scheduling problem. Second, we regard the multi-tenant case (Section 3.3.2): Given multiple competing training jobs (each having a number of processes), how to map them to the available infrastructure? The multi-tenant case introduces additional challenges such as a larger complexity and additional requirements or constraints such as fairness among the tenants. Third, we regard a specific problem that concerns the creation of training jobs in DL, namely, the model architecture and hyper-parameter search (Section 3.3.3). This problem is tightly coupled to single-tenant and multi-tenant scheduling.

3.3.1. Single-tenant

In single-tenant scheduling, we assume a dedicated, but possibly dynamic, set of resources (compute nodes, CPUs, GPUs) that is available to host a set of processes that originate from a single DL training job. With training job, we refer to all processes involved in performing the training of a single DL model. Depending on the parallelization method, this may comprise workers that train complete (data parallelism) or partial (model parallelism) model replicas as well as parameter servers. Now, scheduling needs to answer the following questions: (1) Which process is placed on which resource (such as compute node, CPU, or GPU)? (2) When or in what order are the processes that are placed on the same resource executed? (3) When and how are the number of processes and/or resources adapted?

In model parallelism, one of the major problems to be solved is to partition the model into multiple parts. We have discussed this issue and state-of-the-art approaches for addressing it in Section 3.2.1. Once the model is partitioned, the next important questions are where to place the model parts and when to train which partition of the model. As a training iteration of a model partition can only be executed when all input data of that partition is available, there are dependencies in scheduling the different model partitions. Mayer et al. (Mayer:2017:TPS:3154842.3154843, )

have formalized the scheduling problem in model-parallel DL. While they propose a couple of heuristic algorithms, none of them have been implemented in the context of DL systems. In particular, there are interdependencies between the model partition and the scheduling problem, which are yet to be fully explored. Additional challenges arise with the advent of dynamic control flow 

(Yu:2018:DCF:3190508.3190551, ; Jeong:2018:IED:3190508.3190530, ) that renders static scheduling infeasible. Park et al. (park2019accelerated, ) propose layer placement, which is however limited to CNNs. STRADS (Kim:2016:SDF:2901318.2901331, ) by Kim et al. is a model-parallel ML framework with an advanced scheduler. In particular, STRADS can take into account dependency structures in model partitions and is capable of prioritizing computations. To do so, the user has to implement his training task via three functions schedule, update and aggregate. While the paper contains example implementations of classical ML algorithms such as LASSO and topic modeling, it is not straight-forward to implement a model-parallel DL training job via the STRADS interface.

Litz (216041, ) by Qiao et al. is an elastic ML framework that exposes an event-driven programming model. In Litz, computations are decomposed into micro-tasks that are dynamically scheduled on a cluster. The scheduler takes into account dependencies and consistency requirements of the ML model. To enable interruption-free elasticity, the input data is “over-partitioned” across logical executors which are dynamically mapped to physical resources. This allows even for transparent scaling of stateful workers, i.e., workers that keep local state that is not shared via the parameter servers or directly with peer workers. This property is useful when different model state is affected by the training of different ranges of input data, such that for faster access that portion of the model state is directly kept at the worker.

Proteus (Harlap:2017:PAM:3064176.3064182, ) by Harlap et al. exploits transient resources such as Amazon EC2 spot instances and Google Compute Engine preemptible instances. Its main concepts are a parameter server framework that is optimized for bulk addition and revocation of transient resources, and a resource allocation component that dynamically allocates transient resources to minimize the overall monetary cost per work based on highly dynamic spot markets.

CROSSBOW (DBLP:journals/corr/abs-1901-02244, ) by Koliousis et al. is a decentralized data-parallel DL system that can automatically tune the number of workers at run-time. To do so, the number of workers is increased during the training until no more increase in training throughput can be observed. This way, the available infrastructure can be utilized in an optimal way. Further, CROSSBOW comes with a dynamic task scheduler to execute workers on GPUs based on resource availability. FlexPS (Huang:2018:FFP:3187009.3177734, ) by Huang et al. takes on the problem of varying workloads during the execution of ML/DL training. As sources of varying workloads, Huang et al. mention adaptive hyper-parameters (specifically, the batch size), and advanced SGD methods such as SVRG (Johnson:2013:ASG:2999611.2999647, ). As a result of this problem, the parallelism degree, i.e., the number of workers, needs to be adapted to re-balance the trade-off between communication and computation in data-parallel training.

3.3.2. Multi-tenant

In a multi-tenant environment, multiple training jobs (tenants) share a common set of resources. Hence, a resource scheduler is responsible to schedule the processes of the different tenants on the resources. There is a large variety of general purpose resource schedulers such as Mesos (Hindman:2011:MPF:1972457.1972488, ), YARN (Vavilapalli:2013:AHY:2523616.2523633, ), and Borg (Verma:2015:LCM:2741948.2741964, )

. However, these are not tailored to the specific properties of DL training tasks. For instance, in DL, the convergence rate of a training task varies over time. Typically, in the beginning of training, progress is made very quickly; however, as training evolves over many epochs, the improvements on model accuracy decrease. Further, different DL training jobs may have very different training curves 

(Zhang:2017:SQS:3127479.3127490, ). Taking into account these DL-specific properties allows for formulating new, DL-specific optimization goals, e.g., maximizing the overall training progress over all scheduled training jobs. Hence, new DL resource schedulers are being proposed.

Dolphin (lee2016dolphin, ) by Lee et al. is an elastic centralized data-parallel ML/DL framework. In Dolphin, the configuration of the parameter servers and workers is adapted dynamically according to a cost model and continuous monitoring. Here, the configuration refers to the number of servers and workers, the distribution of training data across workers and the distribution of model parameters across parameter servers. The system is implemented on top of Apache REEF (Weimer:2015:RRE:2723372.2742793, ), a framework for distributed applications. Optimus (Peng:2018:OED:3190508.3190517, )

by Peng et al. is a system that dynamically adjusts the number and placement of workers and parameter servers of a training job at run-time to achieve the best resource efficiency and training speed. To do so, it builds performance models based on sampling that estimate the number of training epochs needed until convergence and the impact of different configurations (number of workers and parameter servers) on the training speed. Then, a greedy algorithm computes the best allocation of resources to workers and parameter servers. Considering multiple concurrent training jobs to be scheduled, Optimus aims to minimize the average job completion time. An additional challenge tackled by Optimus is to divide the model parameters onto the parameter servers such that the load is balanced. Compared to the general-purpose scheduling policies Dominant Resource Fairness 

(Ghodsi:2011:DRF:1972457.1972490, ) and Tetris (Grandl:2014:MPC:2619239.2626334, ), Optimus shows significant improvements in average job completion time and makespan333The makespan of a set of training jobs is the total time elapsed from the arrival of the first job to the completion of all jobs.. Jeon et al. (jeon2019analysis, ) analyze log traces from a large-scale DL cluster system. In particular, they analyze the trade-off between locality constraints and queuing delays for large training jobs that occupy a lot of (GPU) resources. Further, they observe that co-locating different jobs on the same server may significantly impact their performance. Finally, they also analyze failures in DL training and the root causes why they occur. They differentiate between failures caused by the infrastructure, by the DL framework, and by the user. Based on their analysis, they propose a couple of best practices for multi-tenant DL scheduling. First, they emphasize that locality is a major design goal of schedulers that should definitely be taken into account. Second, they highlight that isolation of jobs is important in order to avoid performance interference. Third, they propose that new jobs should first be tested on a small dedicated set of servers before being admitted to the cluster. (Li:2018:ETM:3187009.3177737, ) is an ML service platform that employs a multi-tenant resource scheduler. Users define their training jobs in a declarative language and submit them to via a web interface. Then, not only schedules that job on the available resource, but also automates model architecture and hyper-parameter search. The overall goal of is to maximize the average model accuracy achieved among all tenants, i.e., users of the system. SLAQ (Zhang:2017:SQS:3127479.3127490, ) by Zhang et al. has a similar goal, but supports a broader set of optimization goals. It does not only maximize average accuracy, but also solves a min-max problem to provide fairness among the tenants. Ray (Nishihara:2017:RML:3102980.3102998, ; 222605, ) from UC Berkeley is a distributed system that is specialized to support the requirements of reinforcement learning. The design of Ray makes it necessary to dynamically schedule millions of tasks per second, where each task represents a remote function invocation that may only take as little as a few milliseconds to complete. The scheduler in Ray is hierarchical with two levels: one single global scheduler and a local scheduler per node. As long as a node is not overloaded, the local scheduler schedules its tasks autonomously. However, if a local scheduler detects overload, it forwards tasks to the global scheduler, which assigns them to other nodes.

Besides publications that describe concrete multi-tenant schedulers, there are publications that describe DL services. IBM Fabric for Deep Learning (ffdl:17, ) (FfDL) is a cloud-based deep learning stack used at IBM by AI researchers. Based on FfDL, IBM offers DL as a Service (DLaaS) (DLaaS, ), a fully automated cloud solution for DL. Hauswald et al. (7284053, ) describe Djinn, an open infrastructure for DL as a service in large-scale distributed infrastructures, as well as Tonic, a suite of DL applications for image, speech and language processing. They analyze the workloads of their system and propose a design for large-scale infrastructures that is suitable to DL workloads. One of their findings is that employing GPUs for DL training and inference can reduce total cost of ownership tremendously compared to applying only CPUs. In their analysis, they take into account upfront capital expenditures, operating costs and financing costs. While GPUs have a higher purchase price, such investment pays off due to lower operating costs when processing DL workloads.

3.3.3. Model Architecture and Hyper-Parameter Search

Model architecture and hyper-parameter search is a crucial problem in DL training. Given a specific task (e.g., image classification), what is the best model architecture (e.g., CNN with how many layers and what layer dimensions) that can reach the best accuracy? And what are the best hyper-parameter settings to reach model convergence quickly? Finding the answer to those questions is difficult. The typical approach is to repeatedly try out different architectures and hyper-parameter settings in order to find the best one, i.e., a search based on experimental evaluations (Sparks:2015:AMS:2806777.2806945, ). The search can be random (Bergstra:2012:RSH:2503308.2188395, )

or guided by more sophisticated models, such as random forests and Bayesian optimization 

(Hutter:2014:EAA:3044805.3044891, ) or even reinforcement learning (baker2017designing, ; zoph2017neural, ). What all of those methods have in common is that they repeatedly spawn new training jobs with new configurations (architectures and hyper-parameter settings) that need to be scheduled on a shared set of distributed resources. Here, we discuss scheduling approaches that explicitly take into account workloads that are generated by such search strategies.

TuPAQ (Sparks:2015:AMS:2806777.2806945, ) by Sparks et al. is a system for automatically generating and executing model search configurations. Based on performance profiles provided by a domain expert, TuPAQ automatically optimizes the amount of resources for data parallel training. Batching together training jobs that access the same training data reduces network load and allows for further optimizations in the execution. HyperDrive (Rasley:2017:HEH:3135974.3135994, ) by Rasley et al. is a scheduler that optimizes the hyper-parameter search more aggressively than TuPAQ does. In particular, HyperDrive supports early stopping of the training of poorly configured jobs. Further, by incorporating the trajectory of learning curves of the trained models, HyperDrive predicts the expected accuracy improvement. Based on that, more resources are assigned to training jobs that have a high expected accuracy improvement compared to other configurations. HiveMind (accelerating-deep-learning-workloads-through-efficient-multi-model-execution, ) by Narayanan et al. is a system designed to optimize the execution of multiple DL training jobs on a single GPU. The system executes a batch of models jointly and performs cross-model optimizations such as operator fusion (e.g., shared layers on different model architectures) and shared I/O (e.g, using the same training data for different configurations). Gandiva (222611, ) by Xiao et al. is a system that schedules sets of jobs for hyper-parameter search simultaneously on a cluster of GPU-powered compute nodes. By exploiting early feedback, subsets of the jobs can be killed and resources can be freed. Based on profiling of job execution times, Gandiva employs a fine-grained application-aware time-slicing of the GPU resources to exploit them optimally. To place the jobs on GPUs, Gandiva also takes into account their memory footprint as well as communication intensity to minimize interference between job executions.

3.4. Data Management

One of the great challenges of large-scale DL is handling the data that is involved. On the one hand side, this refers to the management of training data, whose volume easily exceeds the capabilities of a single disk or multiple disks on a single server. On the other hand side, it refers to the management of the DL models, both fully trained as well as snapshots of models currently in the training phase. The training and model data need to be handled in a suitable manner, while taking into account the available distributed infrastructure, the running training processes and the resource scheduling in the data center.

3.4.1. Training Data

Obtaining large labeled training data sets is a hard problem. One approach to achieve this is to resort to manual labeling

. For instance, to build the ImageNet data set, the authors relied on crowd sourcing via Amazon Mechanical Turk, which led to high accuracy of the labels 

(5206848, ). However, manual labeling is expensive and time-consuming. Hence, there are several approaches to allow for training with highly noisy training data that can be easily obtained, e.g., from web image search. Xiao et al. (Xiao_2015_CVPR, ) embed a label noise model into a DL framework. They train two CNNs: one of the CNNs predicts the label while the other CNN predicts the noise type of the training data set. For training, they first pre-train both CNNs with clean training data. Then, they train the models with the noisy data, but mix in data with clean labels to prevent model drift. Overall, learning from noisy data is a vast research area (cf., e.g., (mnih2012learning, ; DBLP:journals/corr/SukhbaatarF14, )) which we will not cover in its entirety in this survey.

Besides obtaining data (noisy or clean), preprocessing of the training data is an important step in data management. This includes normalization such as cropping, resizing and other adjustments on image data (cirecsan2012multi, ), or data augmentation such as creating spectrograms from speech data (graves2014towards, ). Beyond normalization and augmentation, training a DL model with distorted training data can increase the model’s robustness to noisy input data (Zheng_2016_CVPR, ). Hence, preprocessing of training data takes an important role in the overall DL architecture. For instance, Project Adam and Facebook both describe that preprocessing is performed on distinct data servers (186212, ; 8327042, ).

Once the training data is obtained and preprocessed, it has to be provided to the training servers for feeding it into the DL models in the training iterations. Ozeri et al. (Ozeri:2018:OSD:3286490.3286562, ) use simple and cheap object storage to store and provide the training data. The shortcoming of object storage is that the bandwidth of data provisioning is limited to about 35 MB per second for a single request, while the throughput of training data on a machine with 4 GPUs can reach up to 570 MB per second according to the authors’ own measurements. They add a FUSE-based file system to the DL stack which translates POSIX API requests into REST API requests. To overcome the read throughput limitation, their storage layer converts a single read request into multiple concurrent requests to the object storage to yield higher aggregate bandwidth. Kubernetes Volume Controller (kvc, ) (KVC) is an advanced interface for training data management on Kubernetes clusters. It provides an abstraction on training data that can be used by the training processes, and internally manages data placement and replication transparently to the user. Hoard (DBLP:journals/corr/abs-1812-00669, ) by Pinto et al. is a distributed caching system that stripes the training data across local disks of the worker machines for fast access. Training data is loaded from the backend only once and can then be provisioned from the cache for subsequent epochs and across training tasks that use the same training data (e.g., at exploratory architecture and hyper-parameter search).

3.4.2. Model Data

Managing the trained models is as important as the training process itself. According to Vartak et al. (Vartak:2016:MOD:2939502.2939516, ), model management involves tracking, storing and indexing of trained models. The goal of model management is to facilitate the sharing, querying and analyzing of the DL models. To make that possible, there are a number of current initiatives and approaches.

To facilitate interoperability between different DL frameworks, the Open Neural Network Exchange Format (ONNX) (onnx, ) is being developed. ONNX is the de-facto standard for exchange of model data between DL frameworks. DL frameworks that natively support ONNX are Caffe2, Chainer (chainer_learningsys2015, ; chainermn_mlsys2017, ), CNTK (Seide:2016:CMO:2939672.2945397, ), MXNet (mxnet_learningsys2016, ), PyTorch (paszke2017automatic, ), PaddlePaddle (paddlepaddle, ), Matlab and SAS (sas, ). Moreover, model converters are available for TensorFlow (199317, )

, Keras, Apple CoreML 

(coreML, ), SciKit-learn (pedregosa2011scikit, )

, XGBoost 

(xgboost, ), LIBSVM (CC01a, ), and Tencent ncnn (ncnn, ). ModelDB (Vartak:2016:MOD:2939502.2939516, ) by Vartak et al. is a system for model management that provides automatic tracking of ML models, indexing, and querying via SQL or via a visual interface. Beyond the models themselves, ModelDB also manages meta data (e.g., hyper-parameters of the training process), quality metrics and training and test data sets for each model. ModelHub (7930008, ) by Miao et al. is a system that serves a similar purpose as ModelDB. Beyond providing a versioned model storage and query engine and a domain specific language for model architecture and hyper-parameter search, ModelHub also provides a repository-based model sharing system for easy exchange of DL models between different organizations.

4. Comparison of Deep Learning Frameworks

Since the rise of DL, a large number different DL frameworks and tools have been developed and many of them are open source. They implement different concepts of parallelization and distribution that we have discussed in Section 3. Having a large choice of open-source DL frameworks is one of the drivers of innovative DL research. In this section, we review and compare current open-source DL frameworks and tools.

4.1. Evaluation Criteria

We discuss and compare the frameworks according to the following criteria.

(1) Ease of use. DL frameworks should support a large range of programming languages, so that experts from different domains have easy access to them. Moreover, they should provide high-level abstractions so that a running DL use case can be created quickly without many obstacles.

(2) Distribution and parallelization. In a cloud environment, resources are available abundantly and on demand. DL frameworks should allow for easy and intuitive support for distribution and parallelization without need for custom code. We specifically examine this point with regard to the parallelization methods and optimizations we have discussed in Section 3.2.

(3) Customization. Advanced users of a DL system should have the opportunity to fine-tune their deployment according to their needs. This point relates to the DL frameworks’ support for custom definitions of the DL model and loss functions and developing custom code for parameter servers or custom topologies in decentralized systems (cf. Section 3.2.2).

(4) Community. An important aspect of open source DL frameworks is how active the community is. We measure community activity by the number of commits on the official Github repositories in the past six months as well as the total number of topics with the respective tags on StackOverflow444Due to limitations of the StackOverflow search, we did not confine the search to recent topics, but report the overall numbers without time constraint..

Name Project Website Papers Ease of use D&P Custom. Comm.
Caffe (DBLP:journals/corr/JiaSDKLGGD14, ) API: CLI, Python, Matlab Rating: + o Github: 2 StOv: 2,750
Caffe2 n/a API: C++, Python Rating: + + ++ StOv: 116 Github: n/a
Chainer (chainer_learningsys2015, ; chainermn_mlsys2017, ) API: Python Rating: o + ++ Github: 3,939 StOv: 132
CNTK (Seide:2016:CMO:2939672.2945397, ) API: C++, C#, Python, BrainScript Rating: + + ++ Github: 138 StOv: 488
Deeplearning4j n/a API: Java Rating: o o o Github: 390 StOv: 243
Keras n/a Integration: CNTK, DL4j, TensorFlow, Theano Rating: + + ++ Github: 310 StOv: 14,630
MXNet (mxnet_learningsys2016, ) API: C++, Go, JavaScript, Julia, Matlab, Perl, Python, R, Scala, Wolfram Rating: ++ ++ ++ Github: 837 StOv: 455
PyTorch (paszke2017automatic, ) API: C++, Python Rating: + + ++ Github: 3,484 StOv: 2,413
SINGA https://singa.incubator. (Ooi:2015:SDD:2733373.2807410, ) API: C++, Python Rating: + + Github: 44 StOv: 0
TensorFlow (199317, ) API: C++, Go, Java, JavaScript, Python, Swift Rating: ++ ++ ++ Github: 10,930 StOv: 39,334
Theano software/theano/ (bergstra2011theano, ) API: Python Rating: o o + Github: 55 StOv: 2,389
Table 3. Comparison of open source DL frameworks and libraries. Qualitative rating scale: (++) very good, (+) good, (o) average, (–) poor. D&P: Distribution and Parallelization. Custom.: Customization. Comm.: Community. StOv: StackOverflow (

4.2. Detailed Analysis

In the following, we discuss the frameworks in more detail. Table 3 provides an overview including quantitative and qualitative ratings of the frameworks with regard to our evaluation criteria.

Caffe is a DL framework developed by Berkeley AI Research (BAIR) and community contributors. It comes with command line, Python and Matlab APIs. A specialty of Caffe is the model zoo, a collection of pre-trained models for an easy start. It runs on CUDA platforms (using the cuDNN library) for easy parallelization on GPUs. Caffe does not support distributed training out-of-the-box. However, there are forks and extensions of Caffe such as Intel Caffe555 and CaffeOnSpark666 that support distributed training. There is only little information available in the Caffe documentation of how to customize the framework, e.g., to develop new loss functions. As Caffe does not support multi-node deployment, custom parallelization techniques can not be implemented either. Commit activity on Github has almost completely ceased. On StackOverflow, there are 2,750 questions tagged with “Caffe”, a high value compared to other frameworks.

Caffe2 is a successor of the Caffe framework developed by Facebook and community contributors. The API is available in C++ and Python. The models from Caffe can be easily converted to work with Caffe2. Beyond that, Caffe2 provides its own model zoo as well. Caffe2 extends Caffe in the following ways. First of all, Caffe2 naturally supports distributed training. There is native support for decentralized data-parallel training using the synchronous model; there is no support for (bounded) asynchronous training and no parameter server architecture. There is also native support for quantized models, i.e., models with reduced data type precision. Recently, the code of Caffe2 has been merged into PyTorch (paszke2017automatic, ). This makes it hard to assess the update frequency of the Caffe2 code. On StackOverflow, there are 116 questions tagged with “Caffe2”, a rather low value compared to other frameworks.

Chainer is a DL framework developed by the Japanese company Preferred Networks with several industrial partners and community contributors. It is written in Python and only has a Python interface. There is good documentation on how to write custom functions, optimizers, and trainers. ChainerMN is an extension package that enables distributed and parallel DL on multiple nodes. It supports data parallelism via a decentralized all-reduce architecture using the synchronous training method (no parameter server or asynchronous training are supported). There were 3,939 commits to the official Github repository in the past six months, which is a comparably high value. On StackOverflow, there are 132 questions tagged with “Chainer”, a rather low value compared to other frameworks.

CNTK (Microsoft Cognitive Toolkit) is a DL framework developed by Microsoft and community contributors. The API is available in C++, C# and Python. Additionally, CNTK provides a custom model description language called BrainScript. The model evaluation function can also be used from Java programs. Data-parallel and distributed training is supported out-of-the-box. Optimizations such as gradient quantization are available and easily configurable. CNTK supports the centralized architecture with parameter servers, using asynchronous training or blockwise model update and filtering (BMUF) (7472805, ), a variant of bounded asynchronous training. As of now, model parallelism is not supported by CNTK. Extending CNTK is easy. New operators, loss functions, etc. can be implemented with an API. There were 138 commits to the official Github repository in the past six months, which is a comparably low value. On StackOverflow, there are 488 questions tagged with “CNTK”, a moderate value compared to other frameworks.

Deeplearning4j is a DL framework developed by the company Skymind and community contributors organized in the Eclipse foundation. The framework is written in Java and C++ (for core components), and the API is available in Java which makes it accessible for Java, Scala and Clojure projects (but not from Python). It supports distributed and parallel training by using Spark. There are two variants of data-parallel training implemented. First, a decentralized asynchronous approach proposed by Strom (strom2015scalable, ) that also incorporates compression of gradients. Second, centralized synchronous training with a single parameter server. There is no support for model parallelism. It is easily possible to create custom layer implementations, but more sophisticated customization (loss functions, parallelization configurations, etc.) is not supported. There were 390 commits to the official Github repository in the past six months, which is a moderate value. On StackOverflow, there are 243 questions tagged with “Deeplearning4j”, a rather low value compared to other frameworks.

Keras is not a DL framework, but a DL library that can be integrated into many other DL frameworks, such as CNTK, Deeplearning4j, TensorFlow and Theano. It is developed as a community project, initiated by F. Chollet. Keras is written in Python which allows for its easy integration into other Python-based frameworks. Parallel training on GPUs is naturally supported; higher-level parallelization concepts are up to the DL framework that uses Keras. The library is easily extensible with new modules. There were 310 commits to the official Github repository in the past six months, which is a moderate value. On StackOverflow, there are 14,630 questions tagged with “Keras”, a very high value compared to other frameworks.

MXNet is a DL framework and an Apache project (incubating). Its API is available for C++, Python, Julia, Matlab, JavaScript, Go, R, Scala, Perl, and Wolfram Language. MXNet supports a wide range of parallelization approaches. Model parallelism is supported for multiple GPUs on a single node; there is no support for multi-node model parallelism though. Data parallelism is realized via the centralized architecture with support for using multiple parameter servers via a sharded key-value store. Both synchronous and asynchronous training are supported out-of-the-box. MXNet also supports gradient quantization. It is easy to implement custom operators or layers as well as loss functions. There were 837 commits to the official Github repository in the past six months, which is a moderate value. On StackOverflow, there are 455 questions tagged with “MXNet”, a moderate value compared to other frameworks.

PyTorch is a DL framework developed by Facebook and community contributors. Its API is available for C++ and Python. PyTorch has native support for distributed, data-parallel training, as well as model-parallel training. For data-parallel training, PyTorch implements the decentralized architecture and supports synchronous as well as asynchronous training. Gradient quantization is not supported out-of-the-box. Writing new operators or layers is easily done via extending an interface; it is also possible to write custom loss functions. There were 3,484 commits to the official Github repository in the past six months, which is a comparably high value. On StackOverflow, there are 2,413 questions tagged with “PyTorch”, a rather high value compared to other frameworks.

SINGA is a DL framework and Apache project (incubating) which is developed by community contributors. The initiators of the project are from the National University of Singapore. It has APIs in C++ and Python. Singa has native support for distributed, data-parallel and model-parallel training, as well as hybrid parallelism (combining data and model parallelism). Data parallelism is implemented via the centralized approach with support for multiple parameter servers. However, the decentralized architecture can be emulated by employing each worker with a local parameter server. Both synchronous and asynchronous training are supported. There is no support for gradient quantization or compression. Customization is more difficult than in the other frameworks: The documentation does not contain any hints on how to implement custom layers or loss functions. There were 44 commits to the official Github repository in the past six months, which is a comparably low value. On StackOverflow, there are no questions tagged with “Singa” or “Apache Singa”, and only one single question is returned when searching for the keyword “Singa”.

TensorFlow is an ML/DL framework developed by Google and community contributors. The API is available for C++, Go, Java, JavaScript, Python and Swift. Additionally, the community offers bindings for C#, Haskell, Ruby, Rust and Scala. TensorFlow natively supports distributed and parallel training. In particular, it supports both model parallelism and data parallelism. In data parallelism, the centralized approach via parameter servers is supported, using either asynchronous or synchronous training. Gradient quantization is natively supported. Customization of layers and loss functions is straight forward via implementing the available interfaces. There were 10,930 commits to the official Github repository in the past six months, which is an extremely high value. On StackOverflow, there are 39,334 questions tagged with “TensorFlow”, which is the highest number among all analyzed DL frameworks.

Theano is a DL framework developed by Montreal Institute for Learning Algorithms at the Université de Montréal. The API is available only for Python. There is no support for distributed training on multiple nodes. However, using multiple GPUs on a single node is supported. Theano supports model parallelism, but no data parallelism. New layers can be implemented via an interface. It is also possible to define custom loss functions. At the time of writing this survey, commits to the official Github repository have a low frequency. According to a posting on the Theano mailing list777!topic/theano-users/7Poq8BZutbY, major development of Theano ceased with the release of version 1.0.; however, new maintenance releases have been issued since then. There were still 55 commits to the official Github repository in the past six months. On StackOverflow, there are 2,389 questions tagged with “Theano”, a rather high value compared to other frameworks.

Others. There are a couple of other frameworks that we do not cover in detail in our comparison for various reasons. Minerva (wang2014minerva, ) is an open sourced DL system, but has not been maintained for the past 4 years. SparkNet (moritz2015sparknet, ) allows for distributed DL on Spark, but has not been maintained for the past 3 years. Neon (neon, ) is another DL framework that has ceased development for more than 1 year. Scikit-learn (pedregosa2011scikit, ) is an ML framework and it is not specific to DL. While neural network training is implemented, there is no support for using GPUs or distributed training. The Weka workbench (Frank2010, ) is a collection of ML and data mining algorithms. WekaDeeplearning4j (wekadl4j, ) is a DL package for the Weka workbench. As backend, it uses Deeplearning4j, which we have discussed above.

5. Conclusions and Outlook

DL is becoming increasingly important in industry and academia and is without doubt one of the most impactful revolutions in computer science in the past years. However, the rapid pace in which the field is developing makes it difficult to keep an overview. In particular, DL is currently investigated from many different perspectives and in different communities. In this survey, we took a deeper look into DL from the perspective of scalable distributed systems. We investigated the main challenges to make DL systems scale, and have reviewed the common techniques that have been proposed by researchers to tackle those challenges. This included an analysis of the distributed infrastructures used in DL training as well as techniques for parallelization, scheduling and data management. Finally, we provided an overview and comparison of the current open-sourced DL systems and tools, and analyzed which of the techniques developed in research have actually been implemented. We saw that the wide range of techniques for scalable DL are implemented in open-source DL frameworks. This shows that there is a fruitful interaction between research and practical applications which is one of the reasons why DL has gained such a large momentum.

Looking into the future, we see a couple of trends and open research problems that will be important in the next years. While research on scalable DL was mostly focused on the parallelization and distribution aspects of DL training, there is a need to investigate other parts of the DL environment, such as data management and multi-tenant scheduling. This is a large field for research in the distributed systems and database community. In the following, we highlight a couple of research gaps. While resource elasticity is a well-established technique in other data analytics domains such as stream processing (DBLP:journals/corr/abs-1901-09716, ; DIASDEASSUNCAO20181, ) and graph processing (7484159, ), applying it to DL training is still a largely unexplored problem. Besides training, DL serving, i.e., providing and using trained DL models for inference, receives growing attention (201468, ; Gujarati:2017:SDA:3135974.3135993, ; 8360337, ). Although DL serving is closely related to DL training, the requirements and, hence, the solutions are totally different. Another important aspect of DL is privacy (Shokri:2015:PDL:2810103.2813687, ; Abadi:2016:DLD:2976749.2978318, ; LI201776, ), which receives growing attention due to an increasing awareness in the society for privacy issues in the era of Big Data, fueled by legislative reforms such as the General Data Protection Regulation (GDPR) in the European Union. There is an interesting trade-off between the ever-increasing demand for more training data to improve DL models and the principle of data avoidance and data economy to protect privacy.


  • (1) NVIDIA Collective Communications Library (NCCL). Last Accessed 03/2019.
  • (2) NVIDIA DGX Station. Last Accessed 02/2019.
  • (3) ONNX. Last Accessed 02/2019.
  • (4) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA, 2016), USENIX Association, pp. 265–283.
  • (5) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2016), CCS ’16, ACM, pp. 308–318.
  • (6) Agarwal, A., Chapelle, O., Dudík, M., and Langford, J. A reliable effective terascale linear learning system. Journal of Machine Learning Research 15 (2014), 1111–1133.
  • (7) Akiba, T., Fukuda, K., and Suzuki, S. ChainerMN: Scalable Distributed Deep Learning Framework. In Proceedings of Workshop on ML Systems in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017).
  • (8) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017, pp. 1709–1720.
  • (9) Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., and Zaharia, M. A view of cloud computing. Commun. ACM 53, 4 (Apr. 2010), 50–58.
  • (10) Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 6 (Nov 2017), 26–38.
  • (11) Baker, B., Gupta, O., Naik, N., and Raskar, R. Designing neural network architectures using reinforcement learning. International Conference on Learning Representations (2017).
  • (12) Barroso, L. A., and Hoelzle, U. The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines, 1st ed. Morgan and Claypool Publishers, 2009.
  • (13) Ben-Nun, T., and Hoefler, T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. CoRR abs/1802.09941 (2018).
  • (14) Bergstra, J., Bastien, F., Breuleux, O., Lamblin, P., Pascanu, R., Delalleau, O., Desjardins, G., Warde-Farley, D., Goodfellow, I., Bergeron, A., et al. Theano: Deep learning on gpus with python. In NIPS 2011, BigLearning Workshop, Granada, Spain (2011), vol. 3, Citeseer, pp. 1–48.
  • (15) Bergstra, J., and Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 1 (Feb. 2012), 281–305.
  • (16) Bhattacharjee, B., Boag, S., Doshi, C., Dube, P., Herta, B., Ishakian, V., Jayaram, K. R., Khalaf, R., Krishna, A., Li, Y. B., Muthusamy, V., Puri, R., Ren, Y., Rosenberg, F., Seelam, S. R., Wang, Y., Zhang, J. M., and Zhang, L. Ibm deep learning service. IBM Journal of Research and Development 61, 4/5 (July 2017), 10:1–10:11.
  • (17) Boag, S., Dube, P., Herta, B., Hummer, W., Ishakian, V., K. R., J., Kalantar, M., Muthusamy, V., Nagpurkar, P., and Rosenberg, F. Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs. In Workshop on ML Systems at NIPS’17 (2017).
  • (18) Borisyuk, F., Gordo, A., and Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (New York, NY, USA, 2018), KDD ’18, ACM, pp. 71–79.
  • (19) Bottou, L. Stochastic gradient learning in neural networks. In In Proceedings of Neuro-Nîmes. EC2 (1991), Citeseer.
  • (20) Boybat, I., Le Gallo, M., Nandakumar, S., Moraitis, T., Parnell, T., Tuma, T., Rajendran, B., Leblebici, Y., Sebastian, A., and Eleftheriou, E.

    Neuromorphic computing with multi-memristive synapses.

    Nature communications 9, 1 (2018), 2514.
  • (21) Chang, C.-C., and Lin, C.-J.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27. Software available at
  • (22) Chen, K., and Huo, Q. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (March 2016), pp. 5880–5884.
  • (23) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS) (2015).
  • (24) Chen, X., and Lin, X. Big data deep learning: Challenges and perspectives. IEEE Access 2 (2014), 514–525.
  • (25) Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).
  • (26) Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, 2014), USENIX Association, pp. 571–582.
  • (27) Cipar, J., Ho, Q., Kim, J. K., Lee, S., Ganger, G. R., Gibson, G., Keeton, K., and Xing, E. Solving the straggler problem with bounded staleness. In Presented as part of the 14th Workshop on Hot Topics in Operating Systems (Santa Ana Pueblo, NM, 2013), HotOS 2013, USENIX.
  • (28) CireşAn, D., Meier, U., Masci, J., and Schmidhuber, J. Multi-column deep neural network for traffic sign classification. Neural networks 32 (2012), 333–338.
  • (29) Cireşan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. Deep, big, simple neural nets for handwritten digit recognition. Neural computation 22, 12 (2010), 3207–3220.
  • (30) Coates, A., Huval, B., Wang, T., Wu, D. J., Ng, A. Y., and Catanzaro, B. Deep learning with cots hpc systems. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (2013), ICML’13,, pp. III–1337–III–1345.
  • (31) Apple CoreML. Last Accessed 03/2019.
  • (32) Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., and Stoica, I. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 613–627.
  • (33) Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2014), USENIX ATC’14, USENIX Association, pp. 37–48.
  • (34) Cui, H., Zhang, H., Ganger, G. R., Gibbons, P. B., and Xing, E. P. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems (New York, NY, USA, 2016), EuroSys ’16, ACM, pp. 4:1–4:16.
  • (35) Dai, W., Kumar, A., Wei, J., Ho, Q., Gibson, G., and Xing, E. P. High-performance distributed ml at scale through parameter server consistency models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015), AAAI’15, AAAI Press, pp. 79–87.
  • (36) de Assunção, M. D., da Silva Veith, A., and Buyya, R. Distributed data stream processing and edge computing: A survey on resource elasticity and future directions. Journal of Network and Computer Applications 103 (2018), 1 – 17.
  • (37) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in neural information processing systems (2012), pp. 1223–1231.
  • (38) Dean, J., and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113.
  • (39) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    (June 2009), pp. 248–255.
  • (40) Deng, L. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing 3 (2014), e2.
  • (41) Deyringer, V., Fraser, A., Schmid, H., and Okita, T. Parallelization of neural network training for nlp with hogwild! The Prague Bulletin of Mathematical Linguistics 109, 1 (2017), 29 – 38.
  • (42) Dozat, T.

    Incorporating nesterov momentum into adam.

    In International Conference on Learning Representations 2016 Workshop (2016).
  • (43) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (July 2011), 2121–2159.
  • (44) Erickson, B. J., Korfiatis, P., Akkus, Z., Kline, T., and Philbrick, K. Toolkits and libraries for deep learning. Journal of Digital Imaging 30, 4 (Aug 2017), 400–405.
  • (45) Fischer, V., Koehler, J., and Pfeil, T. The streaming rollout of deep networks - towards fully model-parallel execution. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 4043–4054.
  • (46) Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I. H., and Trigg, L. Weka-A Machine Learning Workbench for Data Mining. Springer US, Boston, MA, 2010, pp. 1269–1277.
  • (47) Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., and Stoica, I. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI’11, USENIX Association, pp. 323–336.
  • (48) Ghosh, A., Kumar, H., and Sastry, P. S. Robust loss functions under label noise for deep neural networks, 2017.
  • (49) Gibiansky, A. Bringing HPC Techniques to Deep Learning. Last Accessed 11/2018.
  • (50) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems (2014), pp. 2672–2680.
  • (51) Grandl, R., Ananthanarayanan, G., Kandula, S., Rao, S., and Akella, A. Multi-resource packing for cluster schedulers. In Proceedings of the 2014 ACM Conference on SIGCOMM (New York, NY, USA, 2014), SIGCOMM ’14, ACM, pp. 455–466.
  • (52) Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning (2014), pp. 1764–1772.
  • (53) Gujarati, A., Elnikety, S., He, Y., McKinley, K. S., and Brandenburg, B. B. Swayam: Distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (New York, NY, USA, 2017), Middleware ’17, ACM, pp. 109–120.
  • (54) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (2015), ICML’15,, pp. 1737–1746.
  • (55) Halevy, A., Norvig, P., and Pereira, F. The unreasonable effectiveness of data. IEEE Intelligent Systems 24, 2 (March 2009), 8–12.
  • (56) Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., and Gibbons, P. B. Pipedream: Fast and efficient pipeline parallel DNN training. CoRR abs/1806.03377 (2018).
  • (57) Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Ganger, G. R., and Gibbons, P. B. Pipedream: Pipeline parallelism for dnn training. In Conference on Systems and Machine Learning (2018), SysML ’18.
  • (58) Harlap, A., Tumanov, A., Chung, A., Ganger, G. R., and Gibbons, P. B. Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets. In Proceedings of the Twelfth European Conference on Computer Systems (New York, NY, USA, 2017), EuroSys ’17, ACM, pp. 589–604.
  • (59) Hashemi, S. H., Jyothi, S. A., and Campbell, R. H. Tictac: Accelerating distributed deep learning with communication scheduling. In Conference on Systems and Machine Learning (2019), SysML ’19.
  • (60) Hauswald, J., Kang, Y., Laurenzano, M. A., Chen, Q., Li, C., Mudge, T., Dreslinski, R. G., Mars, J., and Tang, L. Djinn and tonic: Dnn as a service and its implications for future warehouse scale computers. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) (June 2015), pp. 27–40.
  • (61) Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., and Wang, X. Applied machine learning at facebook: A datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (Feb 2018), pp. 620–629.
  • (62) Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR abs/1712.00409 (2017).
  • (63) Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI’11, USENIX Association, pp. 295–308.
  • (64) Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (Nov 2012), 82–97.
  • (65) Hinton, G. E., and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.
  • (66) Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
  • (67) Hossain, M. Z., Sohel, F., Shiratuddin, M. F., and Laga, H. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51, 6 (Feb. 2019), 118:1–118:36.
  • (68) Huang, X., Baker, J., and Reddy, R. A historical perspective of speech recognition. Commun. ACM 57, 1 (Jan. 2014), 94–103.
  • (69) Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. CoRR abs/1811.06965 (2018).
  • (70) Huang, Y., Jin, T., Wu, Y., Cai, Z., Yan, X., Yang, F., Li, J., Guo, Y., and Cheng, J. Flexps: Flexible parallelism control in parameter server architecture. Proc. VLDB Endow. 11, 5 (Jan. 2018), 566–579.
  • (71) Hubel, D. H., and Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160, 1 (1962), 106–154.
  • (72) Hutter, F., Hoos, H., and Leyton-Brown, K.

    An efficient approach for assessing hyperparameter importance.

    In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (2014), ICML’14,, pp. I–754–I–762.
  • (73) Iandola, F. N., Moskewicz, M. W., Ashraf, K., and Keutzer, K. Firecaffe: Near-linear acceleration of deep neural network training on compute clusters. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016).
  • (74) Iizuka, S., Simo-Serra, E., and Ishikawa, H.

    Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification.

    ACM Trans. Graph. 35, 4 (July 2016), 110:1–110:11.
  • (75) Ishakian, V., Muthusamy, V., and Slominski, A. Serving deep learning models in a serverless platform. In 2018 IEEE International Conference on Cloud Engineering (IC2E) (April 2018), pp. 257–262.
  • (76) Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., and Yang, F. Analysis of large-scale multi-tenant gpu clusters for dnn training workloads. arXiv preprint arXiv:1901.05758 (2019).
  • (77) Jeong, E., Jeong, J. S., Kim, S., Yu, G.-I., and Chun, B.-G. Improving the expressiveness of deep learning frameworks with recursion. In Proceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, ACM, pp. 19:1–19:13.
  • (78) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. CoRR abs/1408.5093 (2014).
  • (79) Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hidden dimensions in accelerating convolutional neural networks. In Proceedings of the 35th International Conference on Machine Learning (Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018), J. Dy and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 2274–2283.
  • (80) Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model parallelism for deep neural networks. CoRR abs/1807.05358 (2018).
  • (81) Jiang, J., Cui, B., Zhang, C., and Yu, L. Heterogeneity-aware distributed parameter servers. In Proceedings of the 2017 ACM International Conference on Management of Data (New York, NY, USA, 2017), SIGMOD ’17, ACM, pp. 463–478.
  • (82) Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association of Computational Linguistics 5, 1 (2017), 339–351.
  • (83) Johnson, R., and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (USA, 2013), NIPS’13, Curran Associates Inc., pp. 315–323.
  • (84) Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (June 2017), pp. 1–12.
  • (85) Karkus, P., Hsu, D., and Lee, W. S. Qmdp-net: Deep learning for planning under partial observability. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4694–4704.
  • (86) Kim, J. K., Ho, Q., Lee, S., Zheng, X., Dai, W., Gibson, G. A., and Xing, E. P. Strads: A distributed framework for scheduled model parallel machine learning. In Proceedings of the Eleventh European Conference on Computer Systems (New York, NY, USA, 2016), EuroSys ’16, ACM, pp. 5:1–5:16.
  • (87) Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (2015), ICLR ’15.
  • (88) Koliousis, A., Watcharapichat, P., Weidlich, M., Mai, L., Costa, P., and Pietzuch, P. R. CROSSBOW: scaling deep learning with small batch sizes on multi-gpu servers. CoRR abs/1901.02244 (2019).
  • (89) Konečný, J., McMahan, H. B., Yu, F. X., Richtarik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning (2016).
  • (90) Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014).
  • (91) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (USA, 2012), NIPS’12, Curran Associates Inc., pp. 1097–1105.
  • (92) Kumar, A., Boehm, M., and Yang, J. Data management in machine learning: Challenges, techniques, and systems. In Proceedings of the 2017 ACM International Conference on Management of Data (New York, NY, USA, 2017), SIGMOD ’17, ACM, pp. 1717–1722.
  • (93) Data Management Tailored for Machine Learning Workloads in Kubernetes. Last Accessed 03/2019.
  • (94) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436.
  • (95) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (Nov 1998), 2278–2324.
  • (96) Lee, V. W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A. D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., and Dubey, P. Debunking the 100x gpu vs. cpu myth: An evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th Annual International Symposium on Computer Architecture (New York, NY, USA, 2010), ISCA ’10, ACM, pp. 451–460.
  • (97) Lee, Y. S. L., Weimer, M., Yang, Y., and Yu, G.-I. Dolphin: Runtime optimization for distributed machine learning. In Proceedings of ICML ML Systems Workshop (2016).
  • (98) Lenz, I., Lee, H., and Saxena, A. Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34, 4-5 (2015), 705–724.
  • (99) Li, H., Kadav, A., Kruus, E., and Ungureanu, C. Malt: Distributed data-parallelism for existing ml applications. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys ’15, ACM, pp. 3:1–3:16.
  • (100) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, 2014), USENIX Association, pp. 583–598.
  • (101) Li, P., Li, J., Huang, Z., Li, T., Gao, C.-Z., Yiu, S.-M., and Chen, K. Multi-key privacy-preserving deep learning in cloud computing. Future Generation Computer Systems 74 (2017), 76 – 85.
  • (102) Li, T., Zhong, J., Liu, J., Wu, W., and Zhang, C. Towards multi-tenant resource sharing for machine learning workloads. Proc. VLDB Endow. 11, 5 (Jan. 2018), 607–620.
  • (103) Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems (2017), pp. 5330–5340.
  • (104) Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, B. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (2018).
  • (105) Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der Laak, J. A., van Ginneken, B., and Sánchez, C. I. A survey on deep learning in medical image analysis. Medical Image Analysis 42 (2017), 60 – 88.
  • (106) Luo, G. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics 5, 1 (May 2016), 18.
  • (107) Mai, L., Hong, C., and Costa, P. Optimizing network performance in distributed machine learning. In 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15) (Santa Clara, CA, 2015), USENIX Association.
  • (108) Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD ’10, ACM, pp. 135–146.
  • (109) Masters, D., and Luschi, C. Revisiting small batch training for deep neural networks. CoRR abs/1804.07612 (2018).
  • (110) Mayer, R., Mayer, C., and Laich, L. The tensorflow partitioning and scheduling problem: It’s the critical path! In Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning (New York, NY, USA, 2017), DIDL ’17, ACM, pp. 1–6.
  • (111) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) (2017).
  • (112) Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M. J., Zadeh, R., Zaharia, M., and Talwalkar, A. Mllib: Machine learning in apache spark. Journal of Machine Learning Research 17, 34 (2016), 1–7.
  • (113) Miao, H., Li, A., Davis, L. S., and Deshpande, A. Towards unified data and lifecycle management for deep learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (April 2017), pp. 571–582.
  • (114) Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q. V., and Dean, J. A hierarchical model for device placement. In International Conference on Learning Representations (2018).
  • (115) Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. Device placement optimization with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (International Convention Centre, Sydney, Australia, 06–11 Aug 2017), D. Precup and Y. W. Teh, Eds., vol. 70 of Proceedings of Machine Learning Research, PMLR, pp. 2430–2439.
  • (116) Mnih, V., and Hinton, G. E. Learning to label aerial images from noisy data. In Proceedings of the 29th International conference on machine learning (ICML-12) (2012), pp. 567–574.
  • (117) Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. Sparknet: Training deep networks in spark. In Proceedings of International Conference on Learning Representations (2016).
  • (118) Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, 2018), USENIX Association, pp. 561–577.
  • (119) MXNet. Last Accessed 03/2019.
  • (120) Nair, V., and Hinton, G. E.

    Rectified linear units improve restricted boltzmann machines.

    In Proceedings of the 27th International Conference on International Conference on Machine Learning (USA, 2010), ICML’10, Omnipress, pp. 807–814.
  • (121) Narayanan, D., Santhanam, K., Phanishayee, A., and Zaharia, M. Accelerating deep learning workloads through efficient multi-model execution. In NIPS Workshop on Systems for Machine Learning (December 2018).
  • (122) Tencent ncnn. Last Accessed 03/2019.
  • (123) Neon. Last Accessed 10/2018.
  • (124) Nilsson, A., Smith, S., Ulm, G., Gustavsson, E., and Jirstrand, M. A performance evaluation of federated learning algorithms. In Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning (New York, NY, USA, 2018), DIDL ’18, ACM, pp. 1–8.
  • (125) Nilsson, N. The quest for artificial intelligence: A history of ideas and achievements. Cambridge University Press, 01 2010.
  • (126) Nishihara, R., Moritz, P., Wang, S., Tumanov, A., Paul, W., Schleier-Smith, J., Liaw, R., Niknami, M., Jordan, M. I., and Stoica, I. Real-time machine learning: The missing pieces. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (New York, NY, USA, 2017), HotOS ’17, ACM, pp. 106–110.
  • (127) Noel, C., and Osindero, S. Dogwild!-distributed hogwild for cpu & gpu. In NIPS Workshop on Distributed Machine Learning and Matrix Computations (2014).
  • (128) Ooi, B. C., Tan, K.-L., Wang, S., Wang, W., Cai, Q., Chen, G., Gao, J., Luo, Z., Tung, A. K., Wang, Y., Xie, Z., Zhang, M., and Zheng, K. Singa: A distributed deep learning platform. In Proceedings of the 23rd ACM International Conference on Multimedia (New York, NY, USA, 2015), MM ’15, ACM, pp. 685–688.
  • (129) Ovtcharov, K., Ruwase, O., Kim, J.-Y., Fowers, J., Strauss, K., and Chung, E. Accelerating deep convolutional neural networks using specialized hardware, February 2015.
  • (130) Ozeri, O., Ofer, E., and Kat, R. Object storage for deep learning frameworks. In Proceedings of the Second Workshop on Distributed Infrastructures for Deep Learning (New York, NY, USA, 2018), DIDL ’18, ACM, pp. 21–24.
  • (131) PaddlePaddle. Last Accessed 03/2019.
  • (132) Park, J. H., Kim, S., Lee, J., Jeon, M., and Noh, S. H. Accelerated training for cnn distributed deep learning through automatic resource-aware layer placement. arXiv preprint arXiv:1901.05803 (2019).
  • (133) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPS-W (2017).
  • (134) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of machine learning research 12, Oct (2011), 2825–2830.
  • (135) Peng, Y., Bao, Y., Chen, Y., Wu, C., and Guo, C. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, ACM, pp. 3:1–3:14.
  • (136) Pinto, C., Gkoufas, Y., Reale, A., Seelam, S., and Eliuk, S. Hoard: A distributed data caching system to accelerate deep learning training on the cloud. CoRR abs/1812.00669 (2018).
  • (137) Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., Shyu, M.-L., Chen, S.-C., and Iyengar, S. S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 51, 5 (Sept. 2018), 92:1–92:36.
  • (138) Pundir, M., Kumar, M., Leslie, L. M., Gupta, I., and Campbell, R. H. Supporting on-demand elasticity in distributed graph processing. In 2016 IEEE International Conference on Cloud Engineering (IC2E) (April 2016), pp. 12–21.
  • (139) Qiao, A., Aghayev, A., Yu, W., Chen, H., Ho, Q., Gibson, G. A., and Xing, E. P. Litz: Elastic framework for high-performance distributed machine learning. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (Boston, MA, 2018), USENIX Association, pp. 631–644.
  • (140) Rasley, J., He, Y., Yan, F., Ruwase, O., and Fonseca, R. Hyperdrive: Exploring hyperparameters with pop scheduling. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (New York, NY, USA, 2017), Middleware ’17, ACM, pp. 1–13.
  • (141) Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems (2011), pp. 693–701.
  • (142) Röger, H., and Mayer, R. A comprehensive survey on parallelization and elasticity in stream processing. CoRR abs/1901.09716 (2019).
  • (143) Roy, P., Song, S. L., Krishnamoorthy, S., Vishnu, A., Sengupta, D., and Liu, X. Numa-caffe: Numa-aware deep learning neural networks. ACM Trans. Archit. Code Optim. 15, 2 (June 2018), 24:1–24:26.
  • (144) Ruder, S. An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016).
  • (145) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533.
  • (146) Sa, C. D., Leszczynski, M., Zhang, J., Marzoev, A., Aberger, C. R., Olukotun, K., and Ré, C. High-accuracy low-precision training. CoRR abs/1803.03383 (2018).
  • (147) SAS. Last Accessed 03/2019.
  • (148) Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85 – 117.
  • (149) Searle, J. R. Minds, brains, and programs. Behavioral and brain sciences 3, 3 (1980), 417–424.
  • (150) Seide, F., and Agarwal, A. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD ’16, ACM, pp. 2135–2135.
  • (151) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014 (September 2014).
  • (152) Sergeev, A., and Balso, M. D. Horovod: fast and easy distributed deep learning in tensorflow. CoRR abs/1802.05799 (2018).
  • (153) Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems (2018), pp. 10435–10444.
  • (154) Shokri, R., and Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2015), CCS ’15, ACM, pp. 1310–1321.
  • (155) Smola, A. What is the Parameter Server? Last Accessed 02/2019.
  • (156) Smola, A., and Narayanamurthy, S. An architecture for parallel topic models. Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 703–710.
  • (157) Socher, R., Chen, D., Manning, C. D., and Ng, A. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 926–934.
  • (158) Sparks, E. R., Talwalkar, A., Haas, D., Franklin, M. J., Jordan, M. I., and Kraska, T. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing (New York, NY, USA, 2015), SoCC ’15, ACM, pp. 368–380.
  • (159) Strom, N. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association (2015).
  • (160) Sukhbaatar, S., and Fergus, R. Learning from noisy labels with deep neural networks. CoRR abs/1406.2080 (2014).
  • (161) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016).
  • (162) Tao, Z., and Li, Q. esgd: Communication efficient distributed deep learning on the edge. In USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18) (Boston, MA, 2018), USENIX Association.
  • (163) Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS) (2015).
  • (164) Valiant, L. G. A bridging model for parallel computation. Commun. ACM 33, 8 (Aug. 1990), 103–111.
  • (165) Vanhoucke, V., Senior, A., and Mao, M. Z. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011 (2011).
  • (166) Vartak, M., Subramanyam, H., Lee, W.-E., Viswanathan, S., Husnoo, S., Madden, S., and Zaharia, M. Modeldb: A system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (New York, NY, USA, 2016), HILDA ’16, ACM, pp. 14:1–14:3.
  • (167) Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., and Baldeschwieler, E. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), SOCC ’13, ACM, pp. 5:1–5:16.
  • (168) Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems (New York, NY, USA, 2015), EuroSys ’15, ACM, pp. 18:1–18:17.
  • (169) Vishnu, A., Siegel, C., and Daily, J. Distributed tensorflow with MPI. CoRR abs/1603.02339 (2016).
  • (170) Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., and Zhang, Z. Minerva: A scalable and highly efficient training platform for deep learning. In NIPS Workshop on Distributed Machine Learning and Matrix Computations (2014).
  • (171) Wang, M., Zhou, H., Guo, M., and Zhang, Z. A scalable and topology configurable protocol for distributed parameter synchronization. In Proceedings of 5th Asia-Pacific Workshop on Systems (New York, NY, USA, 2014), APSys ’14, ACM, pp. 13:1–13:7.
  • (172) Wang, S., Chen, W., Pi, A., and Zhou, X. Aggressive synchronization with partial processing for iterative ml jobs on clusters. In Proceedings of the 19th International Middleware Conference (New York, NY, USA, 2018), Middleware ’18, ACM, pp. 253–265.
  • (173) Wang, W., Zhang, M., Chen, G., Jagadish, H. V., Ooi, B. C., and Tan, K.-L. Database meets deep learning: Challenges and opportunities. SIGMOD Rec. 45, 2 (Sept. 2016), 17–22.
  • (174) Watcharapichat, P., Morales, V. L., Fernandez, R. C., and Pietzuch, P. Ako: Decentralised deep learning with partial gradient exchange. In Proceedings of the Seventh ACM Symposium on Cloud Computing (New York, NY, USA, 2016), SoCC ’16, ACM, pp. 84–97.
  • (175) Wei, J., Dai, W., Qiao, A., Ho, Q., Cui, H., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing (New York, NY, USA, 2015), SoCC ’15, ACM, pp. 381–394.
  • (176) Weimer, M., Chen, Y., Chun, B.-G., Condie, T., Curino, C., Douglas, C., Lee, Y., Majestro, T., Malkhi, D., Matusevych, S., Myers, B., Narayanamurthy, S., Ramakrishnan, R., Rao, S., Sears, R., Sezgin, B., and Wang, J. Reef: Retainable evaluator execution framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2015), SIGMOD ’15, ACM, pp. 1343–1355.
  • (177) wekaDeeplearning4j. Last Accessed 03/2019.
  • (178) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems (2017), pp. 1509–1519.
  • (179) Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).
  • (180) Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A comprehensive survey on graph neural networks. CoRR abs/1901.00596 (2019).
  • (181) XGBoost. Last Accessed 03/2019.
  • (182) Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015).
  • (183) Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra, N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q., Yang, F., and Zhou, L. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, 2018), USENIX Association, pp. 595–610.
  • (184) Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., and Yu, Y. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (June 2015), 49–67.
  • (185) Xu, J., Zhang, Z., Friedman, T., Liang, Y., and Van den Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In Proceedings of the 35th International Conference on Machine Learning (Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018), J. Dy and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 5502–5511.
  • (186) Yan, F., Ruwase, O., He, Y., and Chilimbi, T. Performance modeling and scalability optimization of distributed deep learning systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2015), KDD ’15, ACM, pp. 1355–1364.
  • (187) Young, T., Hazarika, D., Poria, S., and Cambria, E. Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine 13, 3 (Aug 2018), 55–75.
  • (188) Yu, Y., Abadi, M., Barham, P., Brevdo, E., Burrows, M., Davis, A., Dean, J., Ghemawat, S., Harley, T., Hawkins, P., Isard, M., Kudlur, M., Monga, R., Murray, D., and Zheng, X. Dynamic control flow in large-scale machine learning. In Proceedings of the Thirteenth EuroSys Conference (New York, NY, USA, 2018), EuroSys ’18, ACM, pp. 18:1–18:15.
  • (189) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) (San Jose, CA, 2012), USENIX, pp. 15–28.
  • (190) Zhang, H., Hsieh, C., and Akella, V. Hogwild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In 2016 IEEE 16th International Conference on Data Mining (ICDM) (Dec 2016), pp. 629–638.
  • (191) Zhang, H., Stafman, L., Or, A., and Freedman, M. J. Slaq: Quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing (New York, NY, USA, 2017), SoCC ’17, ACM, pp. 390–404.
  • (192) Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., and Xing, E. P. Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (Berkeley, CA, USA, 2017), USENIX ATC ’17, USENIX Association, pp. 181–193.
  • (193) Zhao, H., and Canny, J. Butterfly mixing: Accelerating incremental-update algorithms on clusters. In Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 785–793.
  • (194) Zheng, S., Song, Y., Leung, T., and Goodfellow, I. Improving the robustness of deep neural networks via stability training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016).
  • (195) Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., and Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160 (2016).
  • (196) Zoph, B., and Le, Q. V. Neural architecture search with reinforcement learning. International Conference on Learning Representations (2017).
  • (197) Zou, Y., Jin, X., Li, Y., Guo, Z., Wang, E., and Xiao, B. Mariana: Tencent deep learning platform and its applications. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1772–1777.