1 Introduction
Over the past few years, deep neural networks (DNNs) have driven advances in many practical problems, such as image classification [28, 38], speech recognition [20, 8], machine translation [42, 9], and game playing [37]. Because sophisticated DNN models [23, 40] and larger training datasets [16, 11] have increased the computational requirements to train DNN models, it is now standard practice to parallelize training across distributed heterogeneous clusters [7, 15].
Although DNN applications and the clusters used to parallelize them are increasingly complex, the strategies used by today’s deep learning systems (e.g., TensorFlow
[7], PyTorch
[6], Caffe2 [2], MXNet [12]) to parallelize training remain simple. The most common parallelization technique is data parallelism [28], which places a replica of the entire neural network on each device, so that each device processes a subset of the training data and synchronizes network parameters in different replicas at the end of an iteration. Data parallelism is efficient for computeintensive DNN operations with a few trainable parameters (e.g., convolution) but achieves suboptimal parallelization performance for operations with a large number of parameters (e.g., matrixmultiplication). Another common parallelization strategy is model parallelism [15], which assigns disjoint subsets of a neural network each to a dedicated device. Model parallelism eliminates parameter synchronization between devices but requires data transfers between operations and disallows parallelism within an operation.Previous work [27, 42] has proposed expertdesigned strategies that manually optimize parallelization based on human experts’ domain knowledge and intuitions. For example, [27]
uses data parallelism for convolutional and pooling layers and switches to model parallelism for fullyconnected layers to accelerate training convolutional neural networks. Expertdesigned strategies achieve improved performance compared to data and model parallelism but still result in suboptimal behaviors. Section
8 shows that we are able to find parallelization strategies that are up to 2.3 faster than expertdesigned strategies.In addition to these manually designed parallelization strategies, recent work has proposed automated frameworks [33, 25] for finding efficient parallelization strategies in a limited search space. For example, REINFORCE [33]
uses a reinforcement learning model to learn efficient operation assignments for model parallelism by running diverse strategies on real devices. As another example, OptCNN
[25] is designed for parallelizing DNNs with linear computation graphs (e.g., AlexNet [28], VGG [38]) and automatically finds strategies that exploit parallelism within each DNN operation. Existing automated frameworks only explore either parallelism across different operations (e.g., REINFORCE) or parallelism within a single operation (e.g., OptCNN) and therefore miss faster strategies that use parallelism in both dimensions. We show that exploring a broader search space discovers parallelization strategies 1.23.8 faster than existing automated frameworks (see Section 8).In this paper, we present FlexFlow, a deep learning framework that automatically finds fast parallelization strategies over a significantly broader search space than previous systems. To formalize the problem, we first define the SOAP (SampleOperationAttributeParameter) search space of parallelization strategies for DNNs. The operation dimension describes how different operations in a DNN are parallelized. In addition, for a single DNN operation, the sample and parameter dimensions indicate how training samples and model parameters are distributed across devices. Finally, the attribute dimension defines how different attributes within a sample are partitioned. Compared to existing systems that parallelize DNNs in a subset of SOAP dimensions, FlexFlow considers parallelizing DNNs in all these dimensions and therefore defines a more comprehensive search space that includes existing approaches as special cases.
A key challenge with the much larger SOAP search space is effectively evaluating candidate parallelization strategies to find an efficient one. Prior work such as REINFORCE [33] relies on executing each parallelization strategy on the hardware for one iteration to measure its execution time. Unfortunately, this approach becomes prohibitively expensive with the multiple orders of magnitude larger SOAP search space.
To address this problem, FlexFlow introduces a novel execution simulator
that is accurate for predicting the performance of a parallelization strategy and is three orders of magnitude faster than profiling real executions. The challenge in designing the simulator is how to accurately estimate the execution time of different DNN operators (e.g., convolution and matrix multiplication), which scale nonlinearly in a hardwaredependent way with the data. The FlexFlow simulator relies on the following two facts: (1) many DNN models use a small number of
distinctoperators (e.g., a neural machine translation model
[42]with hundreds of operators only uses four distinct operators); and (2) the execution time of each DNN operator is typically lowvariance and largely independent of the contents of the input data.
The FlexFlow simulator measures the execution time of an operation once for each input size and uses the measured time to predict all operations with the same type, which only takes tens of milliseconds. These estimates are then used to predict the performance of a wide variety of parallelization strategies. In addition, the execution simulator uses a delta simulation algorithm that simulates a new strategy using incremental updates to previous simulations. Compared to existing approaches [33, 32] that measure the performance from real executions, our approach has two advantages. First, the FlexFlow simulator is much faster. As a comparison, REINFORCE [33] requires 1227 hours to find an efficient operation assignment for model parallelism on 4 GPUs, while the FlexFlow simulator enables exploring a more comprehensive search space and finding better parallelization strategies (with 3.43.8x higher throughput than REINFORCE) in 1440 seconds. Furthermore, REINFORCE uses 160 compute nodes (with 4 GPUs on each node) to find an efficient strategy in tens of hours, while our experiments use only a single compute node for the simulator.
The execution simulator also achieves high accuracy for predicting parallelization performance. We evaluate the simulator with six realworld DNNs on two different GPU clusters and show that, for all the measured executions, the relative difference between the real and simulated execution time is less than 30%. Most importantly for the search, we test different strategies for a given DNN application and show that their simulated execution time preserves real execution time ordering.
Using the execution simulator as an oracle, the FlexFlow execution optimizer
uses a general Markov Chain Monte Carlo (MCMC) search algorithm (other search strategies could also be used) to explore the SOAP search space and iteratively propose candidate strategies based on the simulated performance of previous candidates. When the search procedure is finished, the execution optimizer returns the best strategy it has discovered.
We evaluate FlexFlow on a variety of realworld DNN benchmarks including image classification [28, 22, 40], text classification [26], language modeling [43], and neural machine translation [42]. Compared to data/model parallelism and expertdesigned parallelization strategies [27, 42], FlexFlow increases training throughput by up to 3.3, reduces communication costs by up to 5, and achieves significantly better scaling. In addition, FlexFlow also outperforms the strategies found by REINFORCE by 3.43.8 on the same hardware configuration evaluated in REINFORCE, and outperforms OptCNN by 1.21.6, by supporting a broader search space.
To summarize, our contributions are:

We define the SOAP search space for parallelizing DNN applications, which includes strategies that parallelize in any combination of the sample, operation, attribute, and parameter dimensions.

We show that under reasonable assumptions it is possible to reliably predict the execution time of parallelized DNNs using a simulator that is three orders of magnitude faster than actually running the DNNs directly on the hardware.

We describe FlexFlow, a deep learning framework that can search for and execute strategies from the entire SOAP space to accelerate DNN training.

We show that FlexFlow can increase training throughput by up to 3.8 over stateoftheart parallelization approaches while improving scalability.
2 Related Work
Data and model parallelism have been widely used by existing deep learning systems (e.g., TensorFlow [7], Caffe2 [2], and PyTorch [6]) to distribute the training process across devices. Data parallelism [28] keeps a copy of an entire DNN on each device, which is inefficient for operations with a large number of parameters (e.g., denselyconnected layers) and becomes a scalability bottleneck in large scale distributed training. Model parallelism [9, 15] splits a DNN into disjoint subsets and trains each subset on a dedicated device, which reduces communication costs for synchronizing network parameters in a DNN but exposes limited parallelism.
Expertdesigned parallelization strategies manually optimize parallelization for specific DNNs by using experts’ domain knowledge and experience. For example, [27]
introduces “one weird trick” that uses data parallelism for convolutional and pooling layers and switches to model parallelism for denselyconnected layers to accelerate convolutional neural networks. To parallelize recurrent neural networks,
[42] uses data parallelism that replicates the entire DNN on each compute node and switches to model parallelism for intranode parallelization. Although these expertdesigned parallelization strategies achieve performance improvement over data and model parallelism, they are suboptimal. We use these expertdesigned strategies as baselines in our experiments and show that FlexFlow can further improve training performance by up to 2.3.Automated frameworks have been proposed for finding efficient parallelization strategies in a limited search space. REINFORCE [33] uses reinforcement learning to find efficient device placement for model parallelism. OptCNN [25] is designed for parallelizing DNNs with linear computation graphs and automatically finds efficient strategies that exploit parallelism within an operation.
Figure 1 summarizes the parallelism dimensions explored by existing approaches. Data parallelism uses the sample dimension to parallelize the training process, while model parallelism exploits the parameter and operation dimensions. Expertdesigned strategies [27, 42] exploit parallelism in the sample or parameter dimension to parallelize an operation but do not support hybrid parallelism that uses a combination of the sample, attribute, and parameter dimensions to parallelize an operation (see Figure 3). Compared to these manually designed strategies, FlexFlow considers more sophisticated, and often more efficient, strategies to parallelize a single operation. In addition, compared to existing automated frameworks [33, 25], FlexFlow explores a significantly broader search space and is able to find strategies that are up to 3.8 faster.
Graphbased cluster schedulers. Previous work [24, 18] has proposed cluster schedulers that schedule clusterwide tasks by using graphbased algorithms. For example, Quincy [24] maps task scheduling to a flow network and uses a mincost maxflow (MCMF) algorithm to find efficient task placement. Firmament [18] generalizes Quincy by employing multiple MCMF optimization algorithms to reduce task placement latencies. Existing graphbased schedulers optimize task placement by assuming a fixed task graph. However, FlexFlow solves a different problem that requires jointly optimizing how to partition an operation into tasks by exploiting parallelism in the SOAP dimensions and how to assign tasks to devices.
3 Overview
In this section, we compare the FlexFlow programming interface with other frameworks in Section 3.1, provide a general overview of FlexFlow in Section 3.2, and discuss the limitations of our approach in Section 3.3.
3.1 Programming Interface
Similar to existing deep learning systems [7, 6, 2], FlexFlow uses an operator graph to describe all operations and state in a DNN. Each node is an operation (e.g., matrix multiplication, convolution, etc.), and each edge
is a tensor (i.e., a
dimensional array) that is an output of and an input of .As far as we know, most deep learning systems (e.g., TensorFlow [7], PyTorch [6], and Caffe2 [2]) use data parallelism as the default parallelization strategy and support model parallelism as an alternative by allowing users to manually specify the device placement for each operation.
In contrast, FlexFlow takes a device topology describing all available hardware devices and their interconnections, as shown in Figure 2. Each node represents a device (e.g., a CPU or a GPU), and each edge is a hardware connection (e.g., a NVLink, a PCIe, or a network link) between device and . The edges are labeled with the bandwidth and latency of the connection.
FlexFlow automatically finds a parallelization strategy for an operator graph and a device topology. Compared to existing frameworks, FlexFlow has two advantages:
Programmability. For DNN applications with complex operator graphs running on clusters with deep device topologies, it is difficult for application developers, even domain experts, to manually design efficient operation assignments. FlexFlow takes the responsibility for finding efficient parallelization strategies and provides a more productive programming interface.
Portability. A parallelization strategy finetuned for one cluster may behave poorly on other clusters. FlexFlow’s search method automatically selects an efficient strategy for each hardware configuration, without requiring application changes.
3.2 FlexFlow Architecture
The main components of FlexFlow are shown in Figure 2. The FlexFlowexecution optimizer takes an operator graph and a device topology as inputs and automatically generates an efficient parallelization strategy. The optimizer uses a MCMC search algorithm to explore the space of possible parallelization strategies and iteratively proposes candidate strategies that are evaluated by a execution simulator. The execution simulator uses a delta simulation algorithm that simulates a new strategy using incremental updates to previous simulations. The simulated execution time guides the search in generating future candidates. When the search time budget is exhausted, the execution optimizer sends the best discovered strategy to a distributed runtime for parallelizing the actual executions.
3.3 Limitations
The main limitation of our approach is that the execution simulator assumes the execution time of each operation is predictable and independent of the contents of input tensors, as we discuss in Section 5. Therefore, our approach may not be applicable to applications whose execution time is data dependent. However, for the DNN applications that are the subject of study here, which are based on dense matrix operations, execution time is highly predictable and independent of the contents of the matrices.
4 The SOAP Search Space
This section introduces the SOAP search space of parallelization strategies for DNNs. To parallelize a DNN operation across devices, we require each device to compute a disjoint subset of the operation’s output tensors. Therefore, we model the parallelization of an operation by defining how the output tensor of is partitioned.
Operation  Parallelizable Dimensions  

(S)ample  (A)ttribute  (P)arameter  
1D pooling  sample  length, channel  
1D convolution  sample  length  channel 
2D convolution  sample  height, width  channel 
Matrix multiplication  sample  channel 
dimension index different samples and neurons in a tensor, respectively. For 1D and 2D images, the
length and the combination of height and width dimensions specify a position in an image.For an operation , we define its parallelizable dimensions as the set of all divisible dimensions in its output tensor. always includes a sample dimension. For all other dimensions in , we call it a parameter dimension if partitioning over that dimension requires splitting the model parameters and call it an attribute dimension otherwise. Table 1 shows the parallelizable dimensions of some example operations. Finally, we also consider parallelism across differ operations in the operation dimension.
A parallelization configuration of an operation defines how the operation is parallelized across multiple devices. Figure 3 shows some example configurations for parallelizing a 1D convolution operation in a single dimension as well as combinations of multiple dimensions.
For each parallelizable dimension in , includes a positive integer that is the degree of parallelism in that dimension. is the product of the parallelism degrees for all parallelizable dimensions of . We use equal size partitions in each dimension to guarantee wellbalanced workload distributions. A parallelization configuration partitions the operation into independent tasks, denoted as , meanwhile also includes the device assignment for each task (). Given the output tensor of a task and its operation type, we can infer the necessary input tensors to execute each task.
Figure 4 shows an example parallelization configuration for a matrix multiplication operation (i.e., ). The operation is partitioned into four independent tasks assigned to dedicated GPU devices. The input and output tensors of the tasks are shown in the figure.
A parallelization strategy describes one possible parallelization of an application. includes a parallelization configuration for each operation , and each ’s configuration can be chosen independently from among all possible configurations for .
5 Execution Simulator
In this section, we describe the execution simulator, which takes an operator graph , a device topology , and a parallelization strategy as inputs and predicts the execution time to run on using strategy . FlexFlow simulates the execution process instead of measuring the elapsed time from real executions for two reasons. First, processing one iteration of a DNN application can take seconds even on modern GPUs [19, 7]. The simulator runs up to three orders of magnitude faster than real executions and allows the execution optimizer to explore many more candidates in a given time budget. Second, the execution simulator requires fewer computation resources. A largescale execution on thousands of devices can be simulated on a single node.
The simulator depends on the following assumptions:

The execution time of each task is predictable with low variance and is independent of the contents of input tensors.

For each connection between device and with bandwidth , transferring a tensor of size from to takes time (i.e., the communication bandwidth can be fullyutilized).

Each device processes the assigned tasks with a FIFO (firstinfirstout) scheduling policy. This is the policy used by modern devices such as GPUs.

The runtime has negligible overhead. A device begins processing a task as soon as its input tensors are available and the device has finished previous tasks.
To simulate an execution, the simulator first builds a task graph, which includes all tasks derived from operations and dependencies between tasks, and then runs a simulation algorithm to generate an execution timeline. Section 5.1 describes task graph construction. Section 5.2 introduces a full simulation algorithm that builds timelines from scratch. Finally, Section 5.3 introduces an alternative delta simulation algorithm that generates a new timeline using incremental updates to a previous one.
5.1 Task Graph
A task graph models dependencies between individual tasks derived from operations and can also represent task execution timelines on individual devices. To unify the abstraction, we treat each hardware connection between devices as a communication device, and each data transfer as a communication task. Note that devices and hardware connections are modeled as separate devices. This allows computation (i.e., normal tasks) and communication (i.e., communication tasks) to be overlapped if possible.
Given an operator graph , a device topology , and a parallelization strategy , we use the following steps to construct a task graph , where each node is a task (i.e., a normal task or a communication task) and each edge is a dependency that task cannot start until task is completed. Note that the edges in the task graph are simply ordering constraints—the edges do not indicate data flow, as all data flow is included in the task graph as communication tasks.

For each operation with parallelization configuration , we add tasks into .

For each tensor , which is an output of operation and an input of , we compute the output subtensors written by tasks () and the input subtensors read by tasks (). For every task pair and with shared tensors, if two tasks are assigned to the same device, we add an edge into , indicating a dependency between the two tasks, and no communication task is needed. If and with shared tensors are assigned to different devices, we add a communication task to and two edges and to . The new task is assigned to the communication device between the devices that perform and .
Property  Description 

Properties set in graph construction  
exeTime  The elapsed time to execute the task. 
device  The assigned device of the task. 
(t)  
(t)  
Properties set in simulation  
readyTime  The time when the task is ready to run. 
startTime  The time when the task starts to run. 
endTime  The time when the task is completed. 
preTask  The previous task performed on device. 
nextTask  The next task performed on device. 
Internal properties used by the full simulation algorithm  
state  Current state of the task, which is one of 
NOTREADY, READY, and COMPLETE. 
Figure (a)a shows an example parallelization strategy for a standard 3layer recurrent neural network consisting of an embedding layer, a recurrent layer, and a linear layer. The parallelization strategy represents commonly used model parallelism that assigns operations in each layer to a dedicated GPU. Figure (b)b shows the corresponding task graph. Each square and hexagon indicate a normal task and a communication task, respectively, and each directed edge represents a dependency between tasks.
Table 2 lists the properties for each task in the task graph. The exeTime property is set during the graph construction. For a normal task derived from an operation, its exeTime is the time to execute the task on the given device and is estimated by running the task multiple times on the device and measuring the average execution time (assumption A1). A task’s exeTime is cached, and all future tasks with the same operation type and output size will use the cached value without rerunning the task. For a communication task, its exeTime is the time to transfer a tensor (of size ) between devices with bandwidth and is estimated as (assumption A2).
5.2 Full Simulation Algorithm
We now describes a full simulation algorithm that we will use as a baseline for comparisons with our delta simulation algorithm. Algorithm 1 shows the pseudocode. It first builds a task graph using the method described in Section 5.1 and then sets the properties for each task using a variant of Dijkstra’s shortestpath algorithm [14]. Tasks are enqueued into a global priority queue when ready (i.e., all predecessor tasks are completed) and are dequeued in increasing order by their readyTime. Therefore, when a task is dequeued, all tasks with an earlier readyTime have been scheduled, and we can set the properties for task while maintaining the FIFO scheduling order (assumption A3). Figure (c)c shows the execution timeline of the example parallelization strategy.
5.3 Delta Simulation Algorithm
FlexFlow uses a MCMC search algorithm that proposes a new parallelization strategy by changing the parallelization configuration of a single operation in the previous strategy (see Section 6.2). As a result, in the common case, most of the execution timeline does not change from one simulated strategy to the next. Based on this observation, we introduce a delta simulation algorithm that starts from a previous task graph and only resimulates tasks involved in the portion of the execution timeline that changes, an optimization that dramatically speeds up the simulator, especially for strategies for large distributed machines. The full and delta simulation algorithms always produce the same timeline for a given task graph.
Algorithm 2 shows the pseudocode for the delta simulation algorithm. It first updates tasks and dependencies in the task graph and enqueues all modified tasks into a global priority queue (line 45). Similar to the BellmanFord shortestpath algorithm [14], the delta simulation algorithm iteratively dequeues updated tasks and propagates the updates to subsequent tasks (line 614).
For the example in Figure 5, consider a new parallelization strategy derived from the original strategy (Figure (a)a) by only reducing the parallelism of operation to 1 (i.e., = 1). Figure (d)d shows the task graph for the new parallelization strategy, which can be generated from the original task graph (in Figure (c)c) by updating the simulation properties of tasks in the grey area.
6 Execution Optimizer
DNN  Description  Dataset  Reported Acc.  Our Acc. 

Convolutional Neural Networks (CNNs)  
AlexNet [28]  A 12layer CNN  Synthetic data     
Inceptionv3 [40]  A 102layer CNN with Inception modules [39] 
ImageNet [36]  78.0%^{a}  78.0%^{a} 
ResNet101 [22]  A 101layer residual CNN with shortcut connections  ImageNet [36]  76.4%^{a}  76.5%^{a} 
Recurrent Neural Networks (RNNs)  
RNNTC [26]  4 recurrent layers followed by a softmax layer 
Movie Reviews [1]  79.8%  80.3% 
RNNLM [43]  2 recurrent layers followed by a softmax layer  Penn Treebank [31]  78.4^{b}  76.1^{b} 
NMT [42]  4 recurrent layers followed by an attention and a softmax layer  WMT EnglishGerman [3]  19.67^{c}  19.85^{c} 

top1 accuracy for single crop on the validation dataset (higher is better).

wordlevel test perplexities on the Peen Treebank dataset (lower is better).

BLEU scores [34] on the test dataset (higher is better).
This section describes the execution optimizer that takes an operator graph and a device topology as inputs and automatically finds an efficient parallelization strategy. Using the simulator as an oracle, FlexFlow transforms the parallelization optimization problem into a cost minimization problem, namely minimizing the predicted execution time. The primary advantage of this approach is that it avoids explicitly encoding the tradeoffs between interdependent optimizations (e.g., reducing data transfers v.s. balancing workload distributions) and simply focuses on minimizing the application’s overall execution time.
Finding the optimal parallelization strategy is NPhard, by an easy reduction from minimum makespan [29]. In addition, as described in Section 4
, the number of possible strategies is exponential to the number of operations in the operator graph, which makes it intractable to exhaustively enumerate the search space. To find a lowcost strategy, FlexFlow uses a cost minimization search procedure to heuristically explore the space and returns the best strategy discovered.
6.1 MCMC Sampling
This section briefly introduces the MCMC sampling method used by the execution optimizer. MCMC sampling is a technique for obtaining samples from a probability distribution so that higher probability samples are visited proportionately more often than low probability samples. A common method (described in
[17]) to transform a cost function into a probability distribution is the following, where is a constant that can be chosen:(1) 
MCMC works by starting at any point in the search space (a random point, or perhaps a wellknown starting point) and then generating a sequence of points with the guarantee that in the limit the set of points visited approaches the distribution given by . In our setting, we begin with some parallelization strategy and then generate a sequence of strategies .
We use the MetropolisHastings algorithm [21] for generating Markov chains, which maintains a current strategy and proposes a modified strategy from a proposal distribution . If the proposal is accepted, becomes the new current strategy, otherwise another strategy based on is proposed. This process is repeated indefinitely (e.g., until a time budget is exhausted). If the proposal distribution is symmetric, , the acceptance criteria of a new strategy is the following:
(2) 
The acceptance criteria has several important properties. If has a lower cost than , then is always accepted. If has a higher cost than , then may still be accepted with a probability that decreases as a function of the difference between and . Intuitively, MCMC tends to behave as a greedy search algorithm, preferring to move towards lower cost whenever that is readily available, but can also escape local minima.
6.2 Search Algorithm
Our method for generating proposals is simple: an operation in the current parallelization strategy is selected at random, and its parallelization configuration is replaced by a random configuration. Our definition of the proposal distribution satisfies the symmetry property, , since, for any operation, its configurations are selected with the same probability.
We uses existing strategies (e.g., data parallelism, expertdesigned strategies) as well as randomly generated strategies as the initial candidates for the search algorithm. For each initial strategy, the search algorithm iteratively proposes new candidates until one of the following two criteria is satisfied: (1) the search time budget for current initial strategy is exhausted; or (2) the search procedure cannot further improve the best discovered strategy for half of the search time.
7 FlexFlow Runtime
We found that existing deep learning systems (e.g., TensorFlow [7], PyTorch [6], Caffe2 [2], and MXNet [12]) only support parallelizing an operation in the batch dimension through data parallelism, and it is nontrivial to parallelize an operation in other dimensions or combinations of several dimensions in these systems. In addition, we are not aware of any existing system that supports parallelization at the granularity of individual operations.
To support parallelizing DNN models using any strategy defined in our parallelization space (see Section 4), we implemented the FlexFlow distributed runtime in Legion [10], a highperformance parallel runtime for distributed heterogeneous architectures, and use cuDNN [13] and cuBLAS [4] as the underlying libraries for processing DNN operations. We use the Legion highdimensional partitioning interface [41] to support parallelizing an operation in any combination of the parallelizable dimensions and use Legion’s finegrain control mechanism to control parallelization at the granularity of each operation.
The key difference between the FlexFlow runtime and existing systems is that FlexFlow supports parallelizing an operation in any combination of the parallelizable dimensions and controls parallelization at the granularity of individual operations.
8 Evaluation
This section evaluates the performance of FlexFlow on six realworld DNN benchmarks and two GPU clusters. Section 8.1 describes the experimental setup for the evaluation. Section 8.2 compares FlexFlow with stateoftheart parallelization approaches. Section 8.3 evaluates the accuracy and efficiency of the execution simulator. Sections 8.4 and 8.5 evaluate the quality of the best strategies discovered by the execution optimizer and discuss two of the best discovered strategies.
8.1 Experimental Setup
Table 3 summarizes the DNNs used in our experiments. AlexNet, Inceptionv3, and ResNet101 are three CNNs that achieved the best accuracy in the ILSVRC competitions [35]. For AlexNet, the periteration training time is smaller than the time to load training data from disk. We follow the suggestions in [5] and use synthetic data to benchmark the performance of AlexNet. For all other experiments, the training data is loaded from disk in the training procedure.
RNNTC, RNNLM and NMT are sequencetosequence RNN models for text classification, language modeling, and neural machine translation, respectively. RNNTC uses four LSTM layers with a hidden size of 1024. RNNLM uses two LSTM layers with a hidden size of 2048. Both RNN models include a softmax linear after the last LSTM layer. NMT includes an encoder and a decoder, both of which consist of 2 LSTM layers with a hidden size of 1024. To improve model accuracy, we also use an attention layer [9] on top of the last decoder LSTM layer. Figure 14 illustrates the structure of the NMT model. For all three RNN models, we set the number of unrolling steps for each recurrent layer to 40.
We follow prior work [28, 40, 22, 26, 43, 42]
to construct operator graphs and set hyperparameters (e.g., learning rates, weight decays). We use synchronous training and a batch size of 64 for all DNN benchmarks, except for AlexNet, which uses a batch size of 256.
To evaluate the performance of FlexFlow with different device topologies, we performed the experiments on two GPU clusters, as shown in Figure 6. The first cluster contains 4 compute nodes, each of which is equipped with two Intel 10core E52600 CPUs, 256GB main memory, and four NVIDIA Tesla P100 GPUs. GPUs on the same node are connected by NVLink, and nodes are connected over 100GB/s EDR Infiniband. The second cluster consists of 16 nodes, each of which is equipped with two Intel 10core E52680 GPUs, 256GB main memory, and four NVIDIA Tesla K80 GPUs. Adjacent GPUs are connected by a separate PCIe switch, and all GPUs are connected to CPUs through a shared PCIe switch. Compute nodes in the cluster are connected over 56 GB/s EDR Infiniband.
Unless otherwise stated, we set 30 minutes as the time budget for the execution optimizer and use data parallelism and a randomly generated parallelization strategy as the initial candidates for the search algorithm. As shown in Section 8.3.2, the search procedure terminates in a few minutes for most executions.
8.2 Parallelization Performance
8.2.1 Periteration Performance
We compare the periteration training performance of FlexFlow with the following baselines. Data parallelism is commonly used in existing deep learning systems [7, 2, 6]. To control for implementation differences, we ran data parallelism experiments in TensorFlow r1.7, PyTorch v0.3, and our implementation and compared the performance numbers. Compared to TensorFlow and PyTorch, FlexFlow achieves the same or better performance numbers on all six DNN benchmarks, and therefore we report the data parallelism performance achieved by FlexFlow in the experiments.
Expertdesigned strategies optimize parallelization based on domain experts’ knowledge and experience. For CNNs, [27] uses data parallelism for parallelizing convolutional and pooling layers and switches to model parallelism for denselyconnected layers. For RNNs, [42] uses data parallelism that replicates the entire operator graph on each compute node and uses model parallelism that assign operations with the same depth to the same GPU on each node. These expertdesigned strategies are used as a baseline in our experiments. Model parallelism only exposes limited parallelism by itself, and we compare against model parallelism as a part of these expertdesigned strategies.
Figure 7 shows the periteration training performance on all six DNN benchmarks. For ResNet101, FlexFlow finds strategies similar to data parallelism (except using model parallelism on a single node for the last fullyconnected layer) and therefore achieves similar parallelization performance. For other DNN benchmarks, FlexFlow finds more efficient strategies than the baselines and achieves 1.33.3 speedup. Note that FlexFlow performs the same operations as data parallelism and expertdesigned strategies, and the performance improvement is achieved by using faster parallelization strategies. We found that the parallelization strategies discovered by FlexFlow have two advantages over data parallelism and expertdesigned strategies.
Reducing overall communication costs. Similar to existing deep learning systems, the FlexFlow distributed runtime supports overlapping data transfers with computation to hide communication overheads. However, as we scale the number of devices, the communication overheads increase, but the computation time used to hide communication remains constant. Therefore, reducing overall communication costs is beneficial for largescale distributed training. Figure (b)b shows that, to parallelize the NMT model on 64 K80 GPUs (16 nodes), FlexFlow reduces the periteration data transfers by 25.5 compared to other parallelization approaches.
Reducing overall task computation time. Data parallelism always parallelizes an operation in the batch dimension. However, as reported in [25], parallelizing an operation through different dimensions can result in different task computation time. For the matrix multiplication operation in the NMT model, parallelizing it in the channel dimension reduces the operation’s overall computation time by 38% compared to parallelizing the operation in the batch dimension. Figure (c)c shows that FlexFlow reduces the overall task computation time by 20% compared to data parallelism for the NMT model. The expertdesigned strategy achieves slightly better total task computation time than FlexFlow. However, this is achieved by using model parallelism on each node, which disables any parallelism within each operation and results in imbalanced workloads. As a result, the expertdesigned strategy achieves even worse execution performance than data parallelism (see Figure (a)a). FlexFlow reduces the overall task computation time while enabling parallelism within an operation and maintaining load balance.
8.2.2 Endtoend Performance
FlexFlow performs the same computation as other deep learning systems for a DNN model and therefore achieves the same model accuracy. Table 3 verifies that FlexFlow achieves the stateoftheart accuracies on the DNN benchmarks used in the experiments.
In this experiment, we compare the endtoend training performance between FlexFlow and TensorFlow on Inceptionv3. We train Inceptionv3 on the ImageNet dataset until the model reaches the singlecrop top1 accuracy of 72% on the validation set. The training processes in both frameworks use stochastic gradient decent (SGD) with a learning rate of 0.045 and a weight decay of 0.0001. Figure 9 illustrates the training curves of the two systems on Inceptionv3 and show that FlexFlow reduces the endtoend training time by 38% compared to TensorFlow.
Num.  AlexNet  ResNet  Inception  RNNTC  RNNLM  NMT  

GPUs  Full  Delta  Speedup  Full  Delta  Speedup  Full  Delta  Speedup  Full  Delta  Speedup  Full  Delta  Speedup  Full  Delta  Speedup 
4  0.11  0.04  2.9  1.4  0.4  3.2  14  4.1  3.4  16  7.5  2.2  21  9.2  2.3  40  16  2.5 
8  0.40  0.13  3.0  4.5  1.4  3.2  66  17  3.9  91  39  2.3  76  31  2.5  178  65  2.7 
16  1.4  0.48  2.9  22  7.3  3.1  388  77  5.0  404  170  2.4  327  121  2.7  998  328  3.0 
32  5.3  1.8  3.0  107  33  3.2  1746  298  5.9  1358  516  2.6  1102  342  3.2  2698  701  3.8 
64  18  5.9  3.0  515  158  3.3  8817  1278  6.9  4404  1489  3.0  3406  969  3.6  8982  2190  4.1 
8.2.3 Automated Parallelization Optimizer
We compare against two automated frameworks that find parallelization strategies in a limited search space.
REINFORCE [33] uses reinforcement learning to learn device placement for model parallelism. We are not aware of any publicly available implementation of REINFORCE, so we compare against the learned device placement for Inceptionv3 and NMT, as reported in [33].
Figure (a)a compares the training throughput of the strategies found by FlexFlow and REINFORCE for four K80 GPUs on a single node. The parallelization strategies found by FlexFlow achieve 3.4  3.8 speedup compared to REINFORCE. We attribute the performance improvement to the larger search space explored by FlexFlow.
Besides improving training performance, FlexFlow has two additional advantages over REINFORCE. First, REINFORCE requires executing each strategy in the hardware environment to get reward signals and takes 1227 hours to find the best placement [33], while the FlexFlow execution optimizer finds efficient parallelization strategies for these executions in 1440 seconds. Second, REINFORCE uses up to 160 compute nodes (with 4 GPUs on each node) to find the placement in time, while FlexFlow uses a single compute node to run the execution optimizer.
OptCNN [25] optimizes parallelization for DNNs with linear operator graphs. OptCNN assumes that different operations in an operator graph cannot be performed in parallel and estimates a DNN’s execution time as the sum of the operations’ computation time and synchronization time and the tensors’ data transfer time. This assumption allows OptCNN to use a dynamic programming algorithm to find an efficient parallelization strategy.
We compare the strategies found by FlexFlow and OptCNN for different DNNs on 16 P100 GPUs. The frameworks found the same parallelization strategies for AlexNet and ResNet with linear operator graphs and found different strategies for the other DNNs as shown in Figure (b)b. For these DNNs with nonlinear operator graphs, FlexFlow achieves 1.21.6 speedup compared to OptCNN by using parallelization strategies that exploit parallelism across different operations. We show two examples in Section 8.5.
8.3 Execution Simulator
We evaluate the performance of the simulator using two metrics: simulator accuracy and simulator execution time.
8.3.1 Simulator Accuracy
In this experiment, we compare the estimated execution time predicted by the execution simulator with the real execution time measured by actual executions. Figure 11 shows the results for different DNNs and different available devices. The dashed lines indicate a relative difference of 0% and 30%, respectively, which encompasses the variance between actual and predicted execution time. In addition, for different parallelization strategies with the same operator graph and device topology (i.e., points of the same shape in the figure), their simulated execution time preserves actual execution time ordering, which shows that simulated execution time is an appropriate metric to evaluate the performance of different strategies.
8.3.2 Simulator Execution Time
Figure 12 shows the search performance with different simulation algorithms for finding a strategy for the NMT model on 16 P100 GPUs on 4 nodes. The full and delta simulation algorithms terminate in 16 and 6 minutes, respectively. If the allowed time budget is less than 8 minutes, the full simulation algorithm will find a worse strategy than the delta simulation algorithm.
We compare the endtoend search time of the execution optimizer with different simulation algorithms. For a given DNN model and device topology, we measure the average execution time of the optimizer using 10 random initial strategies. The results are shown in Table 4. The delta simulation algorithm is 2.26.9 faster than the full simulation algorithm. Moreover, the speedup over the full simulation algorithm increases as we scale the number of devices.
8.4 Search Algorithm
This section evaluates the quality of the best parallelization strategies discovered by the search algorithm.
First, we compare the best discovered strategies with the global optimal strategies for small executions. To obtain a search space of reasonable size, we limit the number of devices to 4 and consider the following two DNNs. LeNet [30] is a 6layer CNN for image classification. The second DNN is a variant of RNNLM where the number of unrolling steps for each recurrent layer is restricted to 2. The search space for both DNNs contains approximately strategies. We use depthfirst search to explore the search space and use A [14] to prune the search space. Finding the optimal strategies for LeNet and RNNLM took 0.8 and 18 hours, respectively. For both DNNs, FlexFlow finds the global optimal strategy.
Second, we test if the search algorithm returns at least a locally optimal strategy in larger search spaces by comparing the best discovered strategy with all of its neighbors. For this experiment, we consider all six DNNs on 2, 4, and 8 devices, where the number of neighbors remains small enough to exhaustively enumerate them all. All the strategies returned by FlexFlow were locally optimal.
8.5 Case Studies
We discuss the best strategies discovered by FlexFlow and how they improve parallelization performance.
Inceptionv3. Figure 13 shows the best discovered strategy for parallelizing Inceptionv3 on four P100 GPUs on a single node, which exploits intraoperation parallelism for operations on the critical path and uses a combination of intra and interoperation parallelism for operations on different branches. This results in a wellbalanced workload and reduces data transfers for parameter synchronization. Compared to data parallelism, this strategy reduces the parameter synchronization costs by 75% and the periteration execution time by 12%.
For parallelizing the same Inceptionv3 model on four K80 GPUs with asymmetric connections between GPUs (see Figure (b)b), we observe that the best discovered strategy tends to parallelize operations on adjacent GPUs with a direct connection to reduce the communication costs.
NMT. Figure 14 shows the best discovered strategy for parallelizing NMT on four P100 GPUs, which uses various strategies for parallelizing different layers. We briefly discuss the insights from this strategy. First, for a layer with a large number of network parameters and little computation (e.g., the embed layer), it is beneficial to perform the computation on a small number of GPU devices to reduce parameter synchronization costs. Second, for a layer with a large number of network parameters and a heavy computation workload (e.g., the softmax layer), FlexFlow uses parallelism in the channel dimension and assigns the computation for a subset of channels to each task. This allows each device to use a subset of the network parameters, which reduces parameter synchronization costs while maintaining load balance. Third, for multiple recurrent layers (e.g., the LSTM and attention layers), FlexFlow uses concurrency among different layers as well as parallelism within each operation to cooperatively reduce parameter synchronization costs while balancing load.
9 Conclusion
This paper presents FlexFlow, a deep learning system that automatically finds efficient parallelization strategies for DNN applications. FlexFlow uses a guided randomized search procedure to explore the space of possible strategies and includes an execution simulator that is an efficient and accurate predictor of DNN performance. We evaluate FlexFlow with six realworld DNN benchmarks on two GPU clusters and show FlexFlow significantly outperforms stateoftheart parallelization approaches.
References
 [1] Movie review data. https://www.cs.cornell.edu/people/pabo/moviereviewdata/, 2005.
 [2] A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai, 2016.
 [3] Conference on machine translation. http://www.statmt.org/wmt16, 2016.
 [4] Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas, 2016.
 [5] TensorFlow Benchmarks. https://www.tensorflow.org/performance/benchmarks, 2017.
 [6] Tensors and Dynamic neural networks in Python with strong GPU acceleration. https://pytorch.org, 2017.

[7]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,
M. Wicke, Y. Yu, and X. Zheng.
Tensorflow: A system for largescale machine learning.
In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI, 2016.  [8] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. V. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu. Deep speech 2: Endtoend speech recognition in english and mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML’16.
 [9] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [10] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012.
 [11] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, and P. Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013.
 [12] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
 [13] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
 [14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
 [15] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.

[16]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
ImageNet: A largescale hierarchical image database.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, CVPR, 2009.  [17] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. CRC press, 1995.
 [18] I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 99–115, Savannah, GA, 2016. USENIX Association.
 [19] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
 [20] A. Graves and N. Jaitly. Towards endtoend speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14, 2014.
 [21] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
 [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016.
 [23] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [24] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 261–276. ACM, 2009.
 [25] Z. Jia, S. Lin, C. R. Qi, and A. Aiken. Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR, abs/1802.04924, 2018.
 [26] Y. Kim. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014.
 [27] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
 [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS, 2012.
 [29] S. Lam and R. Sethi. Worst case analysis of two scheduling algorithms. SIAM Journal on Computing, 6, 1977.
 [30] Y. LeCun. LeNet5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 2015.
 [31] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19.
 [32] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean. A hierarchical model for device placement. In International Conference on Learning Representations, 2018.
 [33] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. 2017.
 [34] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 2002.
 [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
 [37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
 [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
 [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [41] S. Treichler, M. Bauer, R. Sharma, E. Slaughter, and A. Aiken. Dependent partitioning. In Proceedings of the 2016 ACM SIGPLAN International Conference on ObjectOriented Programming, Systems, Languages, and Applications, OOPSLA’ 16. ACM, 2016.
 [42] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
 [43] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. CoRR, abs/1409.2329, 2014.
Comments
There are no comments yet.