Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

10/06/2020 ∙ by Bita Hasheminezhad, et al. ∙ Microsoft 0

Although recent scaling up approaches to train deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the primary designs of most available distributed deep learning frameworks and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx presents a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent availability of large-scale datasets and ample computational power has triggered advances in a wide range of Deep Learning (DL) applications. Training a Deep Neural Network (DNN) is an iterative time-consuming process. The slope of advances decreases unless a reasonable training time is maintained. As such, a DL framework must be capable of training the DNN on multiple nodes (be distributed) and efficiently utilize resources on each node.

Most existing DL frameworks are not primarily designed to support major parallelization approaches or work in distributed. In many cases, training a DNN is not investigated as a High Performance Computing (HPC) problem; small models that fit in a single node are developed then scaled as needed. Thus, considerable efforts are required to make those DL frameworks compatible with the requirements of efficient scaling out, namely a distributed address space.

Using HPC capabilities to train DNNs is extremely advantageous. To efficiently utilize resources, the DL framework must support overlapping of communication and computation. There are HPC frameworks that are designed to overlap communication with useful computation adopting fine-grained parallelism. Message-driven communication, adaptive grain size optimization, and moving work to data are other HPC techniques that can be employed to improve the performance in DL arena [5, 65, 68].

In this paper, we make the following contributions: 1) we revisit the requirements for scaling the training of DNNs (section 2), 2) we evaluate the existing distributed DL frameworks based on these requirements (Section 3), and finally 3) we introduce Phylanx as a distributed framework for DL applications. Finally, and we discuss some of Phylanx’s design aspects and show primary Scaling Results (Section 4).

2 Requirements for Scalable Training of DNNs

There is a consensus that DL frameworks must strive to excel at two fundamental classes of characteristics: scalability and ease of use. Many frameworks attempt to put performance and usability at the center of their design [64, 70]. Some suggest that fine-grained parallelism is the key to achieve high performance [19, 44]. In this section, we elaborate on the requirements for such a DL framework.

2.1 Scale up and Scale out

To acquire a reasonable execution time for training modern models on large-scale datasets, it is necessary to exploit every chance of scaling up (scaling on a single node) and scaling out (scaling on multiple nodes) the training process. A single node, even the most powerful one, can never satisfy the memory requirements of contemporary DNNs. In the following, we go through common parallelization techniques usually applied to divide work across workers (compute resources). After that, we lay out the principles that enable as much resource utilization as possible to improve the parallel efficiency of the training process, and thus decrease the overall time required to train a DNN.

Deep Learning Parallelization Approaches

There are three major approaches for DNN parallelization; Data Parallelism, Model Parallelism, and Pipelining. Hybrid Parallelism is any combination of the aforementioned approaches.

Data Parallelism maintains an identical copy of the entire DNN model in addition to a partition of samples within a minibatch of data on each worker. DNN training is comprised of 3 stages: the forward pass, to produce activations and loss using current parameters and labels for the given samples, the backward pass, to compute gradients using the loss and parameters generated by the forward pass, and the solver, to determine the update rules to update the parameters. In the data parallelism approach, every computational resource independently computes the first two stages. The solver uses a collective communication operation for updating the parameters of the network at the end of each minibatch of data to keep model replicas consistent. Data parallelism is the most prevalent approach among DL frameworks to scale out the DNN training.

Model Parallelism splits the DNN along an iteration into disjoint or overlapped subsets and trains each subset on one worker. Model parallelism does not have a unanimous definition among the DL community; some define it as splitting the DNN along model parameters wherein a computational resource has only a portion of model parameters [44], some specify it as splitting channels or spatial dimensions [9]

. In this paper, we define model parallelism as intra-iteration parallelism on any of the tensor dimensions except for the sample dimension. With this broad definition, spatial parallelism is a subset of model parallelism even though it requires all parameters on every computational resource. Model parallelism might incur extra communication. For example, disjoint spatial parallelism requires halo exchange in the forward and backward passes 

[18]. Model parallelism is crucial in training novel, wide and deep models. As effectively splitting the DNNs is non-trivial, model parallelism might impose considerable communication overhead if not designed properly.

Pipelining is a cross-iteration parallelism approach that assigns each consecutive set of layers in the model to a worker. In very deep neural networks, pipelining the work between workers might be beneficial. The data transfer between workers is comprised of activations and gradients. Pipelining is susceptible to the under-utilization of resources. In a naïve implementation of pipelining, only one worker is active at a time due to the sequential dependency of DNNs. The currently active worker performs one of the computationally intensive stages of training on a minibatch, i.e. forward or backward pass. The next worker has to wait to get the calculated activations or gradients from the current worker. Interleaving microbatch computations, and memorizing the result of calculations were proposed to parallelize this process [38, 26].

A pure data parallelization approach is not sufficient for DNN scaling. It is limited by minibatch size and model memory consumption. Although a larger minibatch size means a larger ratio of computation to communication, the minibatch size cannot grow arbitrarily. Memory constraints and accuracy degradation, which slows down convergence and diminishes generalization, [23, 53] impede this growth. Besides, the entire model should fit in the memory even for training on a single sample, which is impossible to achieve for large models. A large-scale DNN training requires hybrid parallelism to scale. This hybrid approach can be a combination of different approaches for different layers of the network, e.g. data parallelism for layers with few parameters (mainly convolutional layers) and model parallelism for layers that are dense in parameters (fully-connected layers) [54]. A combination of different approaches on each of the layers of the network could be desired to satisfy the necessary granularity [18, 6]. Finally, a hybrid approach can use a distributed storage for model parameters, activations, and gradients at the expense of additional communication [67].

Optimal Resource Utilization

The scalability of a system can be viewed as its ability to efficiently utilize an increasing amount of resources while still satisfying its functionality requirements. In order to scale, a distributed implementation of a DNN training must support overlapping communication and computation, and still maintain the desired training accuracy. Although training a DNN on large datasets is a computation/communication intensive process similar to HPC applications [9], many implementations of parallel paradigms, especially data parallelization, in current DL frameworks have adopted inefficient global barriers imposing unnecessary overheads. Additionally, most existing frameworks are not optimized for CPU clusters, notwithstanding the fact that communication optimization and system resiliency among these are well-established. In this subsection, we present three specifications for a system with optimal resource utilization. The desired framework must use asynchronous collectives, a fine-grained execution platform, and its communication model must be integrated into its internal DL component implementation.

The desired scalable DNN framework must utilize asynchronous collectives instead of an asynchronous solver. Updates with Asynchronous Stochastic Gradient Descent (ASGD) were popularized by DistBelief 

[17]

and its successor, TensorFlow 

[1] to mitigate the straggler effect, meaning computational resources must wait for the slowest machine to complete a phase of computation. The ASGD solver has a low statistical efficiency because parameters in each parameter server are updated by every worker of that server whenever the gradient is ready. As such, one worker might complete an iteration when other workers are updating for the next iteration. To avoid staleness of gradient updates, ASGD imposes a small learning rate that causes slow convergence and scaling issues [95, 14]. That could be the reason that TensorFlow only supports synchronous updates in its distributed data parallelism strategy, MultiWorkerMirroredStrategy111https://www.tensorflow.org/guide/distributed_training#multiworkermirroedstrategy and its distributed pipelining strategy, GPipe [38]. Conversely, synchronous updates with asynchronous collectives would allow the workers to fill the idle-time with useful work.

To hide the latencies by using asynchronous collectives, there exist a few conditions: fine-grained but not too small units of work, short context switching times, and greatly reduced synchronization overheads. It is common knowledge in the DL community that a fine-grained execution flow has a higher chance of better scalability. Most DL frameworks support partitioning tensors in the batch dimension, some also support partitioning in other dimensions or a combination of operation dimensions. The desired scalable framework must support partitioning beyond the operation dimensions to be able to efficiently utilize the resources [45]. In practice, tensor partitioning has non-zero overhead and cannot be overused. Also, there is a high variation in gradient sizes, and some of them are too small, far from the size for optimal communication [75]. Thus, the desired framework, when possible, must combine the small tensors together before performing the collective operation [7]. Adopting Asynchronous Many-Task (AMT) runtime systems [55] is a plausible solution to provide fine-grained parallelism beyond hybrid parallelization. Potential AMTs providing distributed computing capabilities are Uintah [22], Chapel [13], Charm++ [52], Kokkos [21], Legion [8], PaRSEC [10], and HPX [48] which are compared in [81].

To exploit every opportunity for optimization, the desired scalable DNN framework must be a unified system having communication integrated into internal implementations of DL components. The desired distributed DL framework must contain the functionality to efficiently read the data, preprocess it, and create the DNN model to train its parameters whilst employing communication between the nodes. It is highly inefficient to run the parts of this workload separately. For instance, ByteScheduler[65] is a generic communication scheduler that relies on a dependency proxy mechanism with a non-zero overhead to interact with the DL frameworks execution engines. ByteScheduler improves the performance by partitioning and rearranging tensor transmissions, but as it cannot modify the source code of the DL framework that it works on, it must use the DL framework Directed Acyclic Graph (DAG) to create a more refined DAG using different kinds of proxies. HyPar-Flow [6] is another example of a non-unified system discussed in section 3.1.3.

2.2 Easy to Use API

A desired scalable DL platform should have a simple interface that highly abstracts the DL components while being easy to debug. Python has become the de-facto language for the ML/DL community as open-source libraries like NumPy 

[90], SciPy [86], Pandas [62], and MatPlotLib [40] keep up with the high performance numerical analysis and visualization demand. Most DL frameworks have an available API in Python as it is coherent, simple, and readable. Not all of these DL frameworks succeed to present a highly abstract or easy to debug API, e.g.

 newer versions of TensorFlow have adopted Keras 

[15] as their default interface. Since debuggability is essential for novel models, the DL community is more receptive toward frameworks with imperative paradigms rather than declarative ones. As such, TensorFlow switched to eager execution as of v2.0.

Besides being intuitive and debuggable, a desired scalable DL framework should have a non-intrusive transition to utilize its parallel and distributed features without requiring the user to add architecture- or platform-specific code or configuration options. A user should not need to modify the DNN model to run it with or without accelerators and should not have to add extra code for setting the parameter servers. Still, the user must decide how many nodes and/or cores they want to use in data and/or model parallelization and/or pipelining, but they need not be aware of the cluster topology to train their DNNs. Therefore, being architecture- and platform-agnostic is an important characteristic of an easy to use API. Runtime-adaptive capabilities that enable automatic optimizations at runtime are an important feature, such as finding optimal grain-sizes during parallelization, or finding the best-possible data distributions depending on the evaluated expressions.

3 Current Distributed Deep Learning Frameworks

In this section, we scrutinize existing distributed DL frameworks. We focus on frameworks that are more recent and popular among users. We did not include pure communication schedulers [65, 7, 28] or frameworks that rely on new hardware [37, 36]. We evaluate these DL frameworks based on the requirements described in Section 2. We summarize our observations in Table 1 where the columns represent: Data Parallelism, Model Parallelism/Pipelining, Communication overlaps Computation, Sufficient Granularity, Unified, Easy to Use, Architecture Agnostic, License, and Reference. Note that Sufficient Granularity represents whether the DL framework can utilize sufficiently fine-grained parallelism. To adequately control the amount of work for the computations (grain size), the DL framework may utilize a combination of parallelization approaches or may use novel solutions. We believe parallelism can and should be applied to the granularity of an individual operation not just to whole layers [44].

3.1 TensorFlow

Although TensorFlow [2] has always natively supported distributed training, data parallelism, and model partitioning (as of v0.8), older versions of TensorFlow require the user to determine the placement of each operation on devices to run on multiple nodes. Communications scheduling is not supported in TensorFlow out-of-the-box. TensorFlow has a centralized architecture wherein the number of workers cannot grow arbitrarily [25].

TensorFlow also has extensions to support different parallelization approaches. As of v2.2, the Multi Worker Mirrored Strategy is introduced and integrated into TensorFlow for data parallelism. Its update rule is synchronous and it has communication and computation overlapped. Google, the developer of TensorFlow, has developed Mesh TensorFlow [72] and GPipe [38] on top of TensorFlow to support model parallelism and pipelining. Many other frameworks are built on top of TensorFlow. Here, we introduce HyPar-Flow, an implementation of hybrid parallelism for DNN training with a Keras-like interface.

3.1.1 Mesh TensorFlow

Mesh TensorFlow (MTF) is a language to lay down a general class of tensor computations that requires the cluster of processors to be identical and reliable to perform optimally [72]. Specifying the layout manually, parallelization is possible on any of the tensor dimensions. MTF considers data parallelism as splitting on the batch dimension and enables the user to experiment with any kind of intra-iteration parallelism. A graph of MTF compiles into a SPMD program that depends on MPI communications techniques such as all_reduce. High communication overheads are introduced by these all_reduce-like operations  [38].

Utilizing MTF is not straightforward. The user must create the layout and take care of having a reasonable chunk size for each tensor based on the cluster topology (mesh of processors). Also, this layout design is restricted by MTF splitting rules, e.g. two dimensions of a tensor are not allowed to be split across the same mesh dimension. Some available features of MTF are only tested on accelerators and, in particular, Tensor Processing Units (TPUs). In some cases, although the GPUs are detected, most computations are run on CPUs. MTF is not compatible with the newer versions of TensorFlow (eager evaluations).

3.1.2 GPipe

GPipe is a pipeline parallelism library implemented under Lingvo [73] framework which is a TensorFlow framework focusing on sequence-to-sequence models. GPipe partitions operation in the forward and backward pass and allows data transfer between neighboring partitions. Therefore, there is a lower bubble overhead in comparison to a naive pipeline parallelization. GPipe does not memorize the activations of layers. As such, it can scale for large models as the number of accelerators may increase immensely, however, it needs to re-compute the activations of layers for the backward pass. In other words, GPipe is a scalable solution that attempts to have a higher utilization in a pipeline, but a portion of this utilization belongs to the re-computation. GPipe can be combined with data parallelism for a better scale.

There are few examples of GPipe available. It is unknown if GPipe is compatible with the eager TensorFlow. GPipe works only on accelerators and it assumes a single layer fits in the memory of one accelerator.

3.1.3 HyPar-Flow

HyPar-Flow is an implementation of data, model, and hybrid (combination of data and model) parallelization on Eager TensorFlow using MPI for communication and Keras for interface[6]. HyPar-Flow uses Horovod (Section3.4) for pure data parallelism. It implements the hybrid parallelism by generating a representation of the Keras code in a distributed manner and an MPI communicator for each model partition to devise the communication and computation overlap. HyPar-Flow is a non-unified framework, and it recognizes TensorFlow as a separate unmodifiable framework. Since in TensorFlow there is no access to the gradients of other layers, HyPar-Flow has to add a layer-like structure before each TensorFlow’s layer.

HyPar-Flow only requires the strategy, the number of model partitions, and the number of model replicas from the user to utilize them with every possible intra-iteration parallelization. Its debuggability depends on its backend, TensorFlow. HyPar-Flow is platform-agnostic and also optimized for many-core CPU clusters, too.

3.2 Caffe

Berkeley AI Research founded Convolutional Architecture for Fast Feature Embedding (Caffe

[43] DL framework which does not support distributed training out-of-the-box. Caffe is a define-and-run (declarative) framework that has three blocking phases for training a DNN. There are many extensions of Caffe that support distributed training centralized or decentralized; each focuses on a specific platform or communication library. FireCaffe [41] and MPI-Caffe [57] use MPI to respectively deploy data and model parallelism on multi-GPU clusters. FeCaffe [30] and Caffe Barista [85]

are FPGA specializations of Caffe for Convolutional Neural Networks (CNNs). Intel-Caffe

222https://github.com/intel/caffe supports data parallelism training on CPU-based clusters. NVIDIA®-Caffe333https://ngc.nvidia.com/catalog/containers/nvidia:caffe is not a distributed framework. Here we discuss S-Caffe [5] and NUMA-Caffe [68] further.

3.2.1 S-Caffe

S-Caffe or OSU-Caffe is a product of co-designing CUDA-Aware MPI runtime and Caffe for data parallelism on GPU clusters. To overlap three phases of training in Caffe, S-Caffe uses on-demand communication and multi-stage non-blocking collectives. Except for a helper thread in gradient aggregation, S-Caffe does not use multi-threading.

3.2.2 NUMA-Caffe

NUMA-Caffe is a distributed DL framework that compliments BVLC-Caffe and Intel-Caffe implementations. BLVC-Caffe and Intel-Caffe apply BLAS-level and batch-level parallelism, respectively. NUMA-Caffe adds two additional levels of parallelism: Non-Uniform Memory Access (NUMA) node-level parallelism and thread-level parallelism. It also eliminates many remote memory accesses providing automatic data localization. NUMA-Caffe is platform-specific and it works around Caffe’s implementation and its focus is Convolutional Neural Networks (CNNs).

3.3 PyTorch DDP

PyTorch, a successor of Caffe2444https://caffe2.ai, is an imperative DL framework developed by Facebook using dynamic computation graphs and automatic differentiation [64]. PyTorch is easy to use, debug, and customize. However, it functions such that % of speed can be traded in order to acquire a considerably intuitive model. Some PyTorch designs are admittedly susceptible to certain corner cases. PyTorch Distributed Data Parallel [58] is an extra feature intercepted to PyTorch to perform DDP computations and is available as of v1.5. PyTorch RPC is developed to support model parallelism but, as of now, is a project in progress.

PyTorch DDP utilizes some techniques that are engineered to increase performance based on practice. These techniques are gradient bucketing (adds a hyper-parameter, bucket, to launch each all_reduce. Small tensors bucket into one all_reduce operation), overlapping communication with computation (which depends on when the first bucket gets ready and the backward computation order), and skipping synchronization. As these techniques are tuned to practice, PyTorch admittedly asserts that some of the solutions are not perfect but are reliable approximations on with minimum overhead. These techniques and shortcomings are discussed in [58]. PyTorch mainly focuses on ease of use, and enables users with options in training their models. For instance, for trainings that require larger scales, developers can explore enabling no_sync mode (skipping synchronization) if it attains acceptable convergence speed.

3.4 Horovod

Horovod [70] is a stand-alone Python library for data parallelism using an optimized ring_allreduce collective and a tensor fusion algorithm that works on top of another DL framework. At first, Horovod only supported TensorFlow as the DL worker, but currently, it supports PyTorch and MXNET, too. Horovod completely replaces the parameter server-based optimizer of TensorFlow which underutilizes the resources because of its communication overhead [4] with its synchronous optimizer. It can almost achieve linear performance gain if the portion of parameters in the dense layers to all parameters is small, e.g.  unlike VGG-16 [74], the dense layer of ResNet50 [29] has a small portion of all parameters. Horovod supports model partitioning but does not support model or pipeline parallelism, so it can train only models that fit into a single device (maybe with multiple GPUs). Although it has one of the most optimized asynchronous collectives, in the absence of granularity, the communication overhead significantly grows with the number of nodes [93].

Horovod has a simple API and its own profiling tool, Horovod Timeline. The transition to distributed on TensorFlow code is easier than the original TensorFlow code transition. It also has a Keras interface which is popular in the DL community. Horovod is optimized for GPUs but can work without GPUs as well.

3.5 FlexFlow

FlexFlow is a DL framework with an execution optimizer that can find a strategy for any intra-iteration hybrid parallelization [44]. The execution simulator is initialized with data parallelism as well as anther parallelization strategy that is selected at random. Unless it takes more than the time budget of searching, it can find the optimal strategy. While this strategy is the best for intra-iteration parallelization, communication between partitions is not necessarily optimal. FlexFlow is built on an AMT, the Legion [8] runtime and it is able to parallelize its task graph at the granularity of a single operation.

FlexFlow has a Keras-like and a native interface. Using FlexFlow’s execution optimizer is rather simple; it only takes the cluster topology and the graph corresponding to the DNN. FlexFlow works only on GPUs, though currently many top clusters in the world are not equipped.

3.6 Chainer

Chainer is the pioneer DL framework with a Define-by-Run (imperative) paradigm [83]. PyTorch is inspired by Chainer but focuses more on performance; PyTorch DDP passes most of the intensive computations to its C++ core while Chainer is developed purely in Python. Chainer only supports data parallelism. It has a synchronous decentralized design that can be realized by all_reduce communication while communication scheduling is not supported.

Chainer is simple to use and debug, and this simplicity has been the focal point of its implementation. Basic knowledge of Python and neural networks is sufficient to use it. GPU runs are supported through CuPy and allow users to code in a CPU/GPU-agnostic environment [84]

. ChainerCV is an add-on package specialized for computer vision tasks for prototyping ideas by non-experts.

3.7 Cntk

Computational Network ToolKit (CNTK) [69] is Microsoft’s open-source declarative library for DL. CNTK has developed four algorithms for its solver: Data-Parallel SGD, Block Momentum SGD, Model Averaging SGD, and Data Parallel ASGD555https://docs.microsoft.com/en-us/cognitive-toolkit/multiple-gpus-and-machines. Data-Parallel SGD uses a trick for reducing the size of communication called 1-bit SGD which compresses the gradient values to a single bit. The difference between the original gradient and its quantized value is added to the next minibatch. Block Momentum SGD uses Blockwise Model Update and Filtering (BUMF) which requires resetting momentum to zero while maintaining frequent model synchronization to converge. Microsoft no longer recommends Model Averaging SGD since it falls behind the Data-Parallel SGD and Block Momentum SGD. Data-Parallel ASGD can be useful for models that are less sensitive to noise. CNTK does not support model parallelism.

CNTK has APIs in Python, C#, C++, and BrainScript which is its domain specific language for defining networks. It provides a performance profiler in Python that generates a detailed profile log. Users can get more information to debug by plotting the underlying graph easily with the logging information666https://cntk.ai/pythondocs/Manual_How_to_debug.html. CNTK has been one of the official Keras’ backends as of v2.0. By changing the argument in the device setting, CNTK can easily switch between its CPUs and GPUs implementations.

3.8 BigDL

BigDL is a distributed DL framework for data parallelism on top of Spark [16]. It does not support model parallelism. Like TensorFlow, BigDL has a parameter server-style architecture, but unlike TensorFlow, it favors coarse-grained operations where data transformations are immutable. BigDL processes its calculations across the nodes through the InvokeAndWait function. On a CPU cluster, BigDL is faster in computations than TensorFlow, benefiting from its CPU optimizations, but it suffers from long communication delays due to its dependency on MapReduce framework [20].

BigDL is integrated into the Spark functional compute model and does not suffer from overheads due to the adaptation of frameworks. It has been developed by Intel Analytics to handle the stream of dynamic and messy data. TensorFlowOnSpark [76] and CaffeOnSpark [12]

use a connector approach to connect to Spark, and both have an execution model that is very different from Spark. BigDL runs a series of a-couple-of-seconds Spark jobs which are scheduled by Spark. On large clusters, scheduling may become a bottleneck for Spark; therefore, on each server, BigDL is set to not launch more than one task with multiple threads in each iteration. A benefit of using Spark is that it is equipped with fault tolerance and a fair load balancing mechanism. BigDL is modeled after Torch

777http://torch.ch/ and is easy to use and debug. It works best on a single Intel Xeon CPU node.

3.9 Singa

SINGA [63] is an imperative distributed DL framework and one of the few that is primarily designed considering scaling. SINGA supports different synchronous (Sandblaster and AllReduce) and asynchronous (Downpour and Distributed Hogwild) solvers and is able to run with different workers/servers configurations. It can have multiple worker/server groups running asynchronously while each group of workers runs synchronously; therefore, SINGA is not limited to the medium-sized clusters. SINGA is not well optimized, it assigns one thread to each processor. SINGA exhibits good scaling out results but lacks in scaling up and accuracy [71, 91]. It supports data and model parallelism.

Users of SINGA should have a clear understanding of its layer-based programming model. They can choose to run their program on multiple nodes, but they need to configure it first. SINGA has several built-in layers. Also, layers are customizable as long as they are consistent with TrainOnBatch algorithm. SINGA is one of the few frameworks which allows the user to manually partition the layer transparently using concatenation and slice layers; it is one of the aspects of its design for scalability. SINGA has APIs in C++ and Python and runs on clusters with or without accelerators.

3.10 Mxnet-Mpi

MXNET-MPI [61] is the extension of MXNET that replaces each worker in a parameter server architecture with a group of workers. Workers of each group are synced together using an MPI collective operation where the solver is synchronous SGD. Therefore, for data parallelism, MXNET-MPI performs better than a fully centralized parameter server dependent architecture or a fully decentralized architecture that synchronizes parameters in a blocking fashion. Thus, overall execution time is reduced, even though there is no explicit scheduling for overlapping communication and computation. Putting workers into groups and having both synchronous and asynchronous updates is similar to SINGA’s design. MXNET does not support multi-node model parallelism.

The key-value store is the critical component for training the DNN with MXNET on multiple nodes. The v1.5.1 MXNET, does not support training on more than 4 GPUs and, the set-up for distributed training is not user-friendly [60].

3.11 DeepSpeed / ZeRO

ZeRO [66] approaches training large models by focusing on solving the memory limitation problem while attempting to minimize the overhead. To train models where the memory of a single GPU is the limiting factor, ZeRO partitions activations, optimizer states, gradients, and parameters and distributes them equally overall available nodes. It then employs overlapping collective operations to reconstruct the tensors as needed. This gives DeepSpeed the memory advantages of model parallelism and pipelining, while retaining the flexibility and ease of use of data parallelism.

The published DeepSpeed [67] implementation can be used as a drop-in replacement for PyTorch’s DDP (Section 3.3) and can optimize any operation that is derived from the pytorch.nn.module. Due to this fine granularity, DeepSpeed can make tailored decisions about which tensors to distribute, whether to off-load memory to CPU and where to limit buffer sizes to prevent out-of-memory issues in the allocator. While the current release version does not include model parallelism or pipelining techniques, DeepSpeed is compatible with approaches that do.

width=1 Framework Data Par Model Par/Pipeline Comm overlaps Comp Sufficient Granularity Unified Easy to Use Archt. Agnostic License Reference Mesh-TensorFlow (-) Apache-2.0 [72] GPipe compatible (-) Apache-2.0 [38] HyPar-Flow (++) N/A [6] S-Caffe/OSU-Caffe N/A N/A [5] NUMA-Caffe (+) Public Domain [68] PyTorch DDP (++) BSD-style [58] Hororvod compatible (++) Apache-2.0 [70] FlexFlow (+) Apache-2.0 [58] Chainer (++) MIT [83] CNTK (++) MIT [69] BigDL (+) Apache-2.0 [16] SINGA (+) Apache-2.0 [63] MXNET-MPI (++) Apache-2.0 [61] DeepSpeed compatible (++) MIT [67] Phylanx (++) Boost-1.0 [82] Only available in binary form
Data Par shows if the framework supports Data Parallelism on multiple nodes. Model Par/Pipeline is checkmarked if the framework supports any network intra-iteration or inter-iteration parallelism other than data parallelism. Comm overlaps Comp highlights if there is any explicit attempt that prevents sequential run of computation and communication. Sufficient Granularity represents whether parallelism is applicable to the granularity of an individual operation. Unified is checkmarked if the infrastructure that makes the training distributed is integrated into the implementation of its DL components. Easy to Use has two measures: simple interface for coding and debugging. Archt. Agnostic means no modification is needed to use the code in distributed on different architectures or platforms. License and Reference are self-explanatory.

Table 1: Comparison between Distributed Deep Learning Frameworks

4 Phylanx as a Deep Learning Framework

Phylanx [27, 82] is an asynchronous distributed array processing framework that combines the benefits of exposing a high-productivity Python-based development environment with a high-performance C++-based execution environment targeting computer systems of any size, including cloud and HPC-cluster technologies. The framework automatically transforms the user-provided Python code into an intermediate representation that is efficiently executed and distributed across all available compute resources as specified by the user.

4.1 How Phylanx tackles design challenges of DL frameworks

Phylanx is a software framework that is based on the HPX runtime system, which in turn was designed from first principles to address well known key challenges of high-performance computing applications. As such, Phylanx implicitly and naturally benefits from inheriting the advantages of applying fine-grained parallelism, message driven computation, constrained-based synchronization (never synchronize more than needed for the local progress of execution), implicit overlapping computation with communication, and runtime-adaptive granularity control. In this section we describe these capabilities on more detail and demonstrate that Phylanx addresses some of the identified challenges of DL frameworks (see Table 1). We are aware that Phylanx is no “jack of all trades” solution, but many of the challenges are tackled.

DL Paralleization approaches

In Phylanx, distribution of data is done by tiling the data arrays involved such that each locality the application runs on is responsible for one (or more) tiles of the data. Phylanx execution is strictly SPMD-style, i.e. each locality executes a structurally equivalent expression graph, while each part of that graph performs communication between the localities depending on the operations to be performed. The execution of this expression graph is done fully asynchronously, and the necessary communication relies on asynchronous collectives. This allows to fully overlap them with the ongoing computations. This approach allows to seamlessly distribute the processed data arrays across a possibly large amount of localities (i.e. nodes in a cluster) while maintaining the best possible scalability. Each of the tiles of the data arrays handled by a locality is internally represented exactly like a fully local data array with the exception that it carries additional meta-information describing the whole (distributed) array. This has the benefit of simplifying the implementation.

Phylanx supports overlapped tiling which is beneficial in spatial parallelization. A halo exchange is needed in forward and backward pass using spatial parallelization. When the kernel size is comparably smaller than the data size, the user can use overlapped tiling to avoid extra communication.

Communication overlaps Computation

Phylanx uses the HPX runtime system as its underlying execution platform. HPX is the C++ Standard Library for Parallelism and Concurrency [46, 32, 48]. It provides the necessary abstractions to build efficient codes that are oblivious to local and remote operations while maintaining efficient data locality. HPX has been described in detail in other publications [31, 34, 49, 51, 33]. In the context of Phylanx, we use HPX because of its dynamic scheduling and global data addressing capabilities as well as its ISO C++ standards conforming API. The constructs it exposes are well-aligned with the existing C++ standards [77, 78, 79, 80]. HPX is fundamental for Phylanx, as it uses shared memory abstractions that have already been adopted in the most recent ISO C++ standards and HPX’s distributed memory abstractions are standards conforming extensions. Using the Futurization technique in HPX, the execution of Phylanx code is expressed as complex dataflow execution graphs that can generate a large amount of fine-grained parallel tasks that are scheduled to execute only when their dependencies are satisfied (see for instance [35]). This naturally and intrinsically enables overlapping computation with communication, thus perfectly hiding communication latencies.

Phylanx achieves overlapping computation and communication using its asynchronous active messaging communication platform (that also includes its asynchronous collectives). Phylanx prefers moving work to data over moving data to work and to means of runtime-adaptively coalesce messages into larger units (tensor fusion) [88, 87], which further reduces the latencies and overheads caused by the necessary communication operations (see Section 4.1 for more information).

Sufficient Granularity

Most of the analyzed DL frameworks have mentioned fine-grained parallelism. The purpose is to have a synchronous optimizer while maintaining acceptable performance. However, with fine-grained parallelism, meaning many small computational tasks are executed, the overhead of context switching takes into account using the system threads.

Phylanx is utilizing HPX’s light-weight user-level threading system to reduce the context switching and synchronization overheads of small computational tasks to its minimum [47, 56]. HPX’s asynchronous execution model in combination with its active-message based communication model intrinsically allows to hide latencies exhibited by those communication operations, naturally overlaps those with useful computation, and reduces synchronization overheads across nodes to a minimum. HPX’s asynchronous collectives allow to break the strict lock-stepping between ranks and enable them to perform other work while the collective operation is being performed.

In order to avoid for threads to become too short lived (which increases associated overheads), Phylanx employs runtime-adaptive techniques for controlling the granularity of tasks and the size of networking packages, both with the goal of reducing overheads and maximizing system utilization [88].

Unified

Phylanx exploits every chance of parallelism since it implements every required DL operation using HPX. Thus, the infrastructure that makes the training scale out is integrated into the DL framework. On the contrary, we observe that not considering scaling out requirements in designing a DL framework has left us with stitching different libraries (connector approach) or re-implementing components to achieve functionality.

Easy to Use

Phylanx provides a high-productivity debugable Python-based interactive interface, JetLag888https://github.com/STEllAR-GROUP/JetLag  [11]. JetLag constitutes a container-based Jupyter notebook interface, a performance visualization platform [92] (Traveler) and a performance measurement library (APEX [39]). The user can code in a Jupyter notebook and easily plot node-link diagram of the execution tree as well as Gantt and utilization charts using APEX performance counters [89].

Architecture Agnostic

Running Phylanx on shared memory or distributed memory is hidden from the user due to the high-level abstractions provided by HPX. A unified syntax and semantics for local and remote operations provided by the Adaptive Global Address Space (AGAS) [3, 50] is utilized. Thus, the user does not need to modify the Python code for local and remote computations or using another platform.

In addition to addressing the above challenges, Phylanx supports fault tolerance. Long-running non-dedicated resources can be used to train a large DNN; therefore, a means of fault tolerance is essential in the occurrence of a failure [1]. TensorFlow provides a checkpoint-based fault tolerance system while BigDL incorporates fault tolerance using Spark’s RDD [94] on every operation. HPX provides software resilience [24] to implement fault tolerance within Phylanx as well. HPX can detect silent data corruptions, e.g. memory bit flips or CPU floating point errors. After some corrupted computation on a node, the user can do one of the following: 1) replay the computation and use the new computation if the silent data corruption vanished. 2) start replicates of the computation which are executed independently. There are three possibilities to compare the replicates: a) use check sums to compare; b) use a user-defined consensus function, which returns the replicate passing the tests; and c) if multiple replicates pass the consensus function, the user provides a validate function to decided which one to use. HPX aims to resolve the problems of scalability, resiliency, power efficiency, and runtime adaptive resource management that continue to grow in importance, as the industry is facing increasing demands in supporting highly distributed heterogeneous systems.

4.2 Primary Scaling Results

We have used an Intel Cascade Lake based cluster called Queen Bee maintained by LONI [59]. This CPU cluster consists of nodes each containing -Core Xeon Platinium processors. We have measured the execution time for the forward pass of a -layers CNN (deduced from Kaggle999https://www.kaggle.com/niyati11/project-convo1d) on a Human Activity Recognition data. For a mini-batch size of , we compare the execution time of Phylanx (git hash: 03ad3b8) to Horovod v0.20.0, which is installed on top of TensorFlow v2.3.0 and uses Gloo (git hash: fe2ad9c) [42] as its communication library in Figure 1. First, it can be seen that the execution time of Horovod does not significantly decrease when using more nodes while that of Phylanx shows a notable reduce, which demonstrates the scalability of Phylanx. Second, we find that Phylanx takes a smaller execution time at least (%) in comparison to Horovod when using 32 or more nodes, which provides insight into running CNN on larger clusters.

Figure 1: Comparison of Phylanx and Horovod executing forward propagation of a CNN on an HPC cluster up to nodes.

5 Conclusion

With the ever-increasing need for reducing the time of training modern deep neural networks, scaling out has become the standard practice. To this effect, we revisited the requirements for distributed training of deep neural networks and described the underlying factors for the desired framework. It must provide a) asynchronous collectives, b) fine-grained execution platform, c) must be unified d) must be easy to use and debug, and e) must be architecture and platform agnostic. Most existing frameworks started as single-node computational models then adapted running on GPU clusters. But we should design a system from first principles to embrace the challenges of platform-agnostic distributed computation. We showed that Phylanx offers great potential even in its current state.

Acknowledgements

The authors are grateful for the support of this work by the LSU Center for Computation & Technology and by the DTIC project: Phylanx Engine Enhancement and Visualizations Development (Contract Number: FA8075-14-D-0002/0007).

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2016)

    Tensorflow: large-scale machine learning on heterogeneous distributed systems

    .
    arXiv preprint arXiv:1603.04467. Cited by: §2.1, §4.1.
  • [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §3.1.
  • [3] P. Amini and H. Kaiser (2019) Assessing the performance impact of using an active global address space in hpx: a case for agas. In 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM), pp. 26–33. Cited by: §4.1.
  • [4] A. A. Awan, J. Bedorf, C. Chu, H. Subramoni, and D. K. Panda (2018) Scalable distributed dnn training using tensorflow and cuda-aware mpi: characterization, designs, and performance evaluation. arXiv preprint arXiv:1810.11112. Cited by: §3.4.
  • [5] A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda (2017) S-caffe: co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 193–205. Cited by: §1, §3.2, Table 1.
  • [6] A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and D. K. Panda (2019) HyPar-flow: exploiting mpi and keras for scalable hybrid-parallel dnn training using tensorflow. arXiv preprint arXiv:1911.05146. Cited by: §2.1, §2.1, §3.1.3, Table 1.
  • [7] Y. Bao, Y. Peng, Y. Chen, and C. Wu (2020) Preemptive all-reduce scheduling for expediting distributed dnn training. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 626–635. Cited by: §2.1, §3.
  • [8] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken (2012) Legion: expressing locality and independence with logical regions. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. Cited by: §2.1, §3.5.
  • [9] T. Ben-Nun and T. Hoefler (2019) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52 (4), pp. 1–43. Cited by: §2.1, §2.1.
  • [10] G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault, and J. J. Dongarra (2013) Parsec: exploiting heterogeneity to enhance scalability. Computing in Science & Engineering 15 (6), pp. 36–45. Cited by: §2.1.
  • [11] S. R. Brandt, A. Bigelow, S. A. Sakin, K. Williams, K. E. Isaacs, K. Huck, R. Tohid, B. Wagle, S. Shirzad, and H. Kaiser (2020) JetLag: an interactive, asynchronous array computing environment. In Practice and Experience in Advanced Research Computing, pp. 8–12. Cited by: §4.1.
  • [12] (2016) CaffeOnSpark2017. External Links: Link Cited by: §3.8.
  • [13] B. L. Chamberlain, D. Callahan, and H. P. Zima (2007) Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications 21 (3), pp. 291–312. Cited by: §2.1.
  • [14] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz (2016) Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981. Cited by: §2.1.
  • [15] F. Chollet et al. (2015) Keras. Cited by: §2.2.
  • [16] J. J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, C. L. Zhang, Y. Wan, Z. Li, et al. (2019) Bigdl: a distributed deep learning framework for big data. In Proceedings of the ACM Symposium on Cloud Computing, pp. 50–60. Cited by: §3.8, Table 1.
  • [17] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, et al. (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231. Cited by: §2.1.
  • [18] N. Dryden, N. Maruyama, T. Benson, T. Moon, M. Snir, and B. Van Essen (2019) Improving strong-scaling of cnn training by exploiting finer-grained parallelism. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 210–220. Cited by: §2.1, §2.1.
  • [19] N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Essen (2019) Channel and filter parallelism for large-scale cnn training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–20. Cited by: §2.
  • [20] X. Du, D. Kuang, Y. Ye, X. Li, M. Chen, Y. Du, and W. Wu (2018) Comparative study of distributed deep learning tools on supercomputers. In International Conference on Algorithms and Architectures for Parallel Processing, pp. 122–137. Cited by: §3.8.
  • [21] H. C. Edwards, C. R. Trott, and D. Sunderland (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74 (12), pp. 3202–3216. Cited by: §2.1.
  • [22] J. D. d. S. Germain, J. McCorquodale, S. G. Parker, and C. R. Johnson (2000) Uintah: a massively parallel problem solving environment. In Proceedings the Ninth International Symposium on High-Performance Distributed Computing, pp. 33–41. Cited by: §2.1.
  • [23] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch sgd: training imagenet in 1 hour

    .
    arXiv preprint arXiv:1706.02677. Cited by: §2.1.
  • [24] N. Gupta, J. R. Mayo, A. S. Lemoine, and H. Kaiser (2020) Implementing software resiliency in hpx for extreme scale computing. arXiv preprint arXiv:2004.07203. Cited by: §4.1.
  • [25] J. Han, L. Xu, M. M. Rafique, A. R. Butt, and S. Lim (2019) A quantitative study of deep learning training on heterogeneous supercomputers. In 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–12. Cited by: §3.1.
  • [26] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons (2018) Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377. Cited by: §2.1.
  • [27] Hartmut Kaiser and Parsa Amini and Steven R. Brandt and Bita Hasheminezhad and Bibek Wagle and R. Tohid and Nanmiao Wu and Shahrzad Shirzad (2020) MPET - A C++ expression template library for arithmetic operations. Note: https://github.com/STEllAR-GROUP/phylanx External Links: Link Cited by: §4.
  • [28] S. H. Hashemi, S. A. Jyothi, and R. H. Campbell (2018) TicTac: accelerating distributed deep learning with communication scheduling. arXiv preprint arXiv:1803.03288. Cited by: §3.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.4.
  • [30] K. He, B. Liu, Y. Zhang, A. Ling, and D. Gu (2019) FeCaffe: fpga-enabled caffe with opencl for deep learning training and inference on intel stratix 10. arXiv preprint arXiv:1911.08905. Cited by: §3.2.
  • [31] T. Heller, H. Kaiser, and K. Iglberger (2012) Application of the ParalleX Execution Model to Stencil-based Problems. In Proceedings of the International Supercomputing Conference ISC’12, Hamburg, Germany, External Links: Link Cited by: §4.1.
  • [32] T. Heller, P. Diehl, Z. Byerly, J. Biddiscombe, and H. Kaiser (2017) HPX – An open source C++ Standard Library for Parallelism and Concurrency. In Proceedings of OpenSuCo 2017, Denver, Colorado USA, November 2017 (OpenSuCo 17), pp. 5. Cited by: §4.1.
  • [33] T. Heller, H. Kaiser, P. Diehl, D. Fey, and M. A. Schweitzer (2016) Closing the Performance Gap with Modern C++. In High Performance Computing, M. Taufer, B. Mohr, and J. M. Kunkel (Eds.), Lecture Notes in Computer Science, Vol. 9945, pp. 18–31. Cited by: §4.1.
  • [34] T. Heller, H. Kaiser, A. Schäfer, and D. Fey (2013) Using HPX and LibGeoDecomp for Scaling HPC Applications on Heterogeneous Supercomputers. In Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA ’13, New York, NY, USA, pp. 1:1–1:8. External Links: ISBN 978-1-4503-2508-0, Link, Document Cited by: §4.1.
  • [35] T. Heller, B. A. Lelbach, K. A. Huck, J. Biddiscombe, P. Grubel, A. E. Koniges, M. Kretz, D. Marcello, D. Pfander, A. Serio, J. Frank, G. C. Clayton, D. Pflüger, D. Eder, and H. Kaiser (2019) Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars. The International Journal of High Performance Computing Applications 0 (0), pp. 1094342018819744. External Links: Document, Link, https://doi.org/10.1177/1094342018819744 Cited by: §4.1.
  • [36] A. HeydariGorji, S. Rezaei, M. Torabzadehkashi, H. Bobarshad, V. Alves, and P. H. Chou (2020)

    HyperTune: dynamic hyperparameter tuning for efficient distribution of dnn training over heterogeneous systems

    .
    arXiv preprint arXiv:2007.08077. Cited by: §3.
  • [37] A. HeydariGorji, M. Torabzadehkashi, S. Rezaei, H. Bobarshad, V. Alves, and P. H. Chou (2020) STANNIS: low-power acceleration of deep neuralnetwork training using computational storage. arXiv preprint arXiv:2002.07215. Cited by: §3.
  • [38] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pp. 103–112. Cited by: §2.1, §2.1, §3.1.1, §3.1, Table 1.
  • [39] M. Huck, A. Porterfield, N. Chaimov, H. Kaiser, A. Malony, T. Sterling, and R. Fowler (2015-07) An autonomic performance environment for exascale. Supercomput. Front. Innov.: Int. J. 2 (3), pp. 49–66. External Links: ISSN 2409-6008 Cited by: §4.1.
  • [40] J. D. Hunter (2007) Matplotlib: a 2d graphics environment. Computing in science & engineering 9 (3), pp. 90–95. Cited by: §2.2.
  • [41] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer (2016) Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2592–2600. Cited by: §3.2.
  • [42] F. Incubator (2017) Gloo: collective communications library with various primitives for multi-machine training. Cited by: §4.2.
  • [43] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. Cited by: §3.2.
  • [44] Z. Jia, M. Zaharia, and A. Aiken (2018) Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358. Cited by: §2.1, §2, §3.5, §3.
  • [45] W. Jiang, Y. Zhang, P. Liu, J. Peng, L. T. Yang, G. Ye, and H. Jin (2020) Exploiting potential of deep neural networks by layer-wise fine-grained parallelism. Future Generation Computer Systems 102, pp. 210–221. Cited by: §2.1.
  • [46] H. Kaiser, B. A. L. aka wash, T. Heller, M. Simberg, A. Bergé, J. Biddiscombe, aurianer, A. Bikineev, G. Mercer, A. Schäfer, K. Huck, A. S. Lemoine, T. Kwon, J. Habraken, M. Anderson, M. Copik, S. R. Brandt, M. Stumpf, D. Bourgeois, D. Blank, S. Jakobovits, V. Amatya, rstobaugh, L. Viklund, Z. Khatami, P. Diehl, T. Pathak, D. Bacharwar, S. Yang, and E. Schnetter (2020-02) STEllAR-GROUP/hpx: HPX V1.4.1: The C++ Standards Library for Parallelism and Concurrency. Zenodo. External Links: Document, Link Cited by: §4.1.
  • [47] H. Kaiser, M. Brodowicz, and T. Sterling (2009) Parallex an advanced parallel execution model for scaling-impaired applications. In 2009 International Conference on Parallel Processing Workshops, pp. 394–401. Cited by: §4.1.
  • [48] H. Kaiser, P. Diehl, A. S. Lemoine, B. A. Lelbach, P. Amini, A. Berge, J. Biddiscombe, S. R. Brandt, N. Gupta, T. Heller, K. Huck, Z. Khatami, A. Kheirkhahan, A. Reverdell, S. Shirzad, M. Simberg, B. Wagle, W. Wei, and T. Zhang (2020) HPX - the c++ standard library for parallelism and concurrency. Journal of Open Source Software 5 (53), pp. 2352. External Links: Document, Link Cited by: §2.1, §4.1.
  • [49] H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey (2014) HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS ’14, New York, NY, USA, pp. 6:1–6:11. External Links: ISBN 978-1-4503-3247-7, Link, Document Cited by: §4.1.
  • [50] H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey (2014) Hpx: a task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, pp. 1–11. Cited by: §4.1.
  • [51] H. Kaiser, T. Heller, D. Bourgeois, and D. Fey (2015) Higher-level parallelization for local and distributed asynchronous task-based programming. In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware, ESPM ’15, New York, NY, USA, pp. 29–37. External Links: ISBN 978-1-4503-3996-4, Link, Document Cited by: §4.1.
  • [52] L. V. Kale and S. Krishnan (1993) Charm++ a portable concurrent object oriented system based on c++. In Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications, pp. 91–108. Cited by: §2.1.
  • [53] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §2.1.
  • [54] A. Krizhevsky (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997. Cited by: §2.1.
  • [55] A. Kulkarni and A. Lumsdaine (2019) A comparative study of asynchronous many-tasking runtimes: cilk, charm++, parallex and am++. arXiv preprint arXiv:1904.00518. Cited by: §2.1.
  • [56] G. Laberge, S. Shirzad, P. Diehl, H. Kaiser, S. Prudhomme, and A. Lemoine (2019)

    Scheduling optimization of parallel linear algebra algorithms using supervised learning

    .
    arXiv preprint arXiv:1909.03947. Cited by: §4.1.
  • [57] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra (2015) Why m heads are better than one: training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314. Cited by: §3.2.
  • [58] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, et al. (2020) PyTorch distributed: experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704. Cited by: §3.3, §3.3, Table 1.
  • [59] Louisiana optical network initiative. External Links: Link Cited by: §4.2.
  • [60] S. Mahon, S. Varrette, V. Plugaru, F. Pinel, and P. Bouvry (2020) Performance analysis of distributed and scalable deep learning. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 760–766. Cited by: §3.10.
  • [61] A. R. Mamidala, G. Kollias, C. Ward, and F. Artico (2018) MXNET-mpi: embedding mpi parallelism in parameter server task model for scaling deep learning. arXiv preprint arXiv:1801.03855. Cited by: §3.10, Table 1.
  • [62] W. McKinney et al. (2011) Pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing 14 (9). Cited by: §2.2.
  • [63] B. C. Ooi, K. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang, et al. (2015) SINGA: a distributed deep learning platform. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 685–688. Cited by: §3.9, Table 1.
  • [64] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §2, §3.3.
  • [65] Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo (2019) A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 16–29. Cited by: §1, §2.1, §3.
  • [66] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2019) Zero: memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054. Cited by: §3.11.
  • [67] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He (2020) DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506. Cited by: §2.1, §3.11, Table 1.
  • [68] P. Roy, S. L. Song, S. Krishnamoorthy, A. Vishnu, D. Sengupta, and X. Liu (2018) NUMA-caffe: numa-aware deep learning neural networks. ACM Transactions on Architecture and Code Optimization (TACO) 15 (2), pp. 1–26. Cited by: §1, §3.2, Table 1.
  • [69] F. Seide and A. Agarwal (2016) CNTK: microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135–2135. Cited by: §3.7, Table 1.
  • [70] A. Sergeev and M. Del Balso (2018) Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799. Cited by: §2, §3.4, Table 1.
  • [71] S. Shams, R. Platania, K. Lee, and S. Park (2017) Evaluation of deep learning frameworks over different hpc architectures. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 1389–1396. Cited by: §3.9.
  • [72] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. (2018) Mesh-tensorflow: deep learning for supercomputers. In Advances in Neural Information Processing Systems, pp. 10414–10423. Cited by: §3.1.1, §3.1, Table 1.
  • [73] J. Shen, P. Nguyen, Y. Wu, Z. Chen, M. X. Chen, Y. Jia, A. Kannan, T. Sainath, Y. Cao, C. Chiu, et al. (2019) Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295. Cited by: §3.1.2.
  • [74] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.4.
  • [75] K. Siu, D. M. Stuart, M. Mahmoud, and A. Moshovos (2018) Memory requirements for convolutional neural network hardware accelerators. In 2018 IEEE International Symposium on Workload Characterization (IISWC), pp. 111–121. Cited by: §2.1.
  • [76] (2017) TensorFlowOnSpark. External Links: Link Cited by: §3.8.
  • [77] The C++ Standards Committee (2011) ISO International Standard ISO/IEC 14882:2011, Programming Language C++. Technical report Geneva, Switzerland: International Organization for Standardization (ISO).. Note: http://www.open-std.org/jtc1/sc22/wg21 Cited by: §4.1.
  • [78] The C++ Standards Committee (2014) ISO International Standard ISO/IEC 14882:2014, Programming Language C++. Technical report Geneva, Switzerland: International Organization for Standardization (ISO).. Note: http://www.open-std.org/jtc1/sc22/wg21 Cited by: §4.1.
  • [79] The C++ Standards Committee (2017) ISO International Standard ISO/IEC 14882:2017, Programming Language C++. Technical report Geneva, Switzerland: International Organization for Standardization (ISO).. Note: http://www.open-std.org/jtc1/sc22/wg21 Cited by: §4.1.
  • [80] The C++ Standards Committee (2020) ISO International Standard ISO/IEC 14882:2020, Programming Language C++. Technical report Geneva, Switzerland: International Organization for Standardization (ISO).. Note: http://www.open-std.org/jtc1/sc22/wg21 Cited by: §4.1.
  • [81] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier, S. Markidis, H. Jordan, et al. (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. The Journal of Supercomputing 74 (4), pp. 1422–1434. Cited by: §2.1.
  • [82] R. Tohid, B. Wagle, S. Shirzad, P. Diehl, A. Serio, A. Kheirkhahan, P. Amini, K. Williams, K. Isaacs, K. Huck, S. Brandt, and H. Kaiser (2018) Asynchronous execution of python code on task-based runtime systems. In 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), Vol. , pp. 37–45. Cited by: Table 1, §4.
  • [83] S. Tokui, R. Okuta, T. Akiba, Y. Niitani, T. Ogawa, S. Saito, S. Suzuki, K. Uenishi, B. Vogel, and H. Yamazaki Vincent (2019) Chainer: a deep learning framework for accelerating the research cycle. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2002–2011. Cited by: §3.6, Table 1.
  • [84] S. Tokui, K. Oono, S. Hido, and J. Clayton (2015) Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5, pp. 1–6. Cited by: §3.6.
  • [85] D. A. Vink, A. Rajagopal, S. I. Venieris, and C. Bouganis (2020) Caffe barista: brewing caffe with fpgas in the training loop. arXiv preprint arXiv:2006.13829. Cited by: §3.2.
  • [86] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al. (2020) SciPy 1.0: fundamental algorithms for scientific computing in python. Nature methods 17 (3), pp. 261–272. Cited by: §2.2.
  • [87] B. Wagle, S. Kellar, A. Serio, and H. Kaiser (2018) Methodology for adaptive active message coalescing in task based runtime systems. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1133–1140. Cited by: §4.1.
  • [88] B. Wagle, M. A. H. Monil, K. Huck, A. D. Malony, A. Serio, and H. Kaiser (2019) Runtime adaptive task inlining on asynchronous multitasking runtime systems. In Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10. Cited by: §4.1, §4.1.
  • [89] B. Wagle, M. A. H. Monil, K. Huck, A. D. Malony, A. Serio, and H. Kaiser (2019) Runtime adaptive task inlining on asynchronous multitasking runtime systems. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, New York, NY, USA. External Links: ISBN 9781450362955, Link, Document Cited by: §4.1.
  • [90] S. v. d. Walt, S. C. Colbert, and G. Varoquaux (2011) The numpy array: a structure for efficient numerical computation. Computing in science & engineering 13 (2), pp. 22–30. Cited by: §2.2.
  • [91] W. Wang, G. Chen, A. T. T. Dinh, J. Gao, B. C. Ooi, K. Tan, and S. Wang (2015) SINGA: putting deep learning in the hands of multimedia users. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 25–34. Cited by: §3.9.
  • [92] K. Williams, A. Bigelow, and K. Isaacs (2019) Visualizing a moving target: a design study on task parallel programs in the presence of evolving data and concerns. IEEE transactions on visualization and computer graphics 26 (1), pp. 1118–1128. Cited by: §4.1.
  • [93] X. Wu, V. Taylor, J. M. Wozniak, R. Stevens, T. Brettin, and F. Xia (2018) Performance, power, and scalability analysis of the horovod implementation of the candle nt3 benchmark on the cray xc40 theta. In SC18 Workshop on Python for High-Performance and Scientific Computing, Dallas, USA, Cited by: §3.4.
  • [94] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. Cited by: §4.1.
  • [95] F. Zhou and G. Cong (2017) On the convergence properties of a -step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012. Cited by: §2.1.