Over the past few years, deep neural networks (DNNs) have driven advances in many practical problems, such as image classification [28, 38], speech recognition [20, 8], machine translation [42, 9], and game playing . Because sophisticated DNN models [23, 40] and larger training datasets [16, 11] have increased the computational requirements to train DNN models, it is now standard practice to parallelize training across distributed heterogeneous clusters [7, 15].
Although DNN applications and the clusters used to parallelize them are increasingly complex, the strategies used by today’s deep learning systems (e.g., TensorFlow
, PyTorch, Caffe2 , MXNet ) to parallelize training remain simple. The most common parallelization technique is data parallelism , which places a replica of the entire neural network on each device, so that each device processes a subset of the training data and synchronizes network parameters in different replicas at the end of an iteration. Data parallelism is efficient for compute-intensive DNN operations with a few trainable parameters (e.g., convolution) but achieves suboptimal parallelization performance for operations with a large number of parameters (e.g., matrix-multiplication). Another common parallelization strategy is model parallelism , which assigns disjoint subsets of a neural network each to a dedicated device. Model parallelism eliminates parameter synchronization between devices but requires data transfers between operations and disallows parallelism within an operation.
uses data parallelism for convolutional and pooling layers and switches to model parallelism for fully-connected layers to accelerate training convolutional neural networks. Expert-designed strategies achieve improved performance compared to data and model parallelism but still result in suboptimal behaviors. Section8 shows that we are able to find parallelization strategies that are up to 2.3 faster than expert-designed strategies.
In addition to these manually designed parallelization strategies, recent work has proposed automated frameworks [33, 25] for finding efficient parallelization strategies in a limited search space. For example, REINFORCE 
uses a reinforcement learning model to learn efficient operation assignments for model parallelism by running diverse strategies on real devices. As another example, OptCNN is designed for parallelizing DNNs with linear computation graphs (e.g., AlexNet , VGG ) and automatically finds strategies that exploit parallelism within each DNN operation. Existing automated frameworks only explore either parallelism across different operations (e.g., REINFORCE) or parallelism within a single operation (e.g., OptCNN) and therefore miss faster strategies that use parallelism in both dimensions. We show that exploring a broader search space discovers parallelization strategies 1.2-3.8 faster than existing automated frameworks (see Section 8).
In this paper, we present FlexFlow, a deep learning framework that automatically finds fast parallelization strategies over a significantly broader search space than previous systems. To formalize the problem, we first define the SOAP (Sample-Operation-Attribute-Parameter) search space of parallelization strategies for DNNs. The operation dimension describes how different operations in a DNN are parallelized. In addition, for a single DNN operation, the sample and parameter dimensions indicate how training samples and model parameters are distributed across devices. Finally, the attribute dimension defines how different attributes within a sample are partitioned. Compared to existing systems that parallelize DNNs in a subset of SOAP dimensions, FlexFlow considers parallelizing DNNs in all these dimensions and therefore defines a more comprehensive search space that includes existing approaches as special cases.
A key challenge with the much larger SOAP search space is effectively evaluating candidate parallelization strategies to find an efficient one. Prior work such as REINFORCE  relies on executing each parallelization strategy on the hardware for one iteration to measure its execution time. Unfortunately, this approach becomes prohibitively expensive with the multiple orders of magnitude larger SOAP search space.
To address this problem, FlexFlow introduces a novel execution simulator
that is accurate for predicting the performance of a parallelization strategy and is three orders of magnitude faster than profiling real executions. The challenge in designing the simulator is how to accurately estimate the execution time of different DNN operators (e.g., convolution and matrix multiplication), which scale non-linearly in a hardware-dependent way with the data. The FlexFlow simulator relies on the following two facts: (1) many DNN models use a small number ofdistinct
operators (e.g., a neural machine translation model
with hundreds of operators only uses four distinct operators); and (2) the execution time of each DNN operator is typically low-variance and largely independent of the contents of the input data.
The FlexFlow simulator measures the execution time of an operation once for each input size and uses the measured time to predict all operations with the same type, which only takes tens of milliseconds. These estimates are then used to predict the performance of a wide variety of parallelization strategies. In addition, the execution simulator uses a delta simulation algorithm that simulates a new strategy using incremental updates to previous simulations. Compared to existing approaches [33, 32] that measure the performance from real executions, our approach has two advantages. First, the FlexFlow simulator is much faster. As a comparison, REINFORCE  requires 12-27 hours to find an efficient operation assignment for model parallelism on 4 GPUs, while the FlexFlow simulator enables exploring a more comprehensive search space and finding better parallelization strategies (with 3.4-3.8x higher throughput than REINFORCE) in 14-40 seconds. Furthermore, REINFORCE uses 160 compute nodes (with 4 GPUs on each node) to find an efficient strategy in tens of hours, while our experiments use only a single compute node for the simulator.
The execution simulator also achieves high accuracy for predicting parallelization performance. We evaluate the simulator with six real-world DNNs on two different GPU clusters and show that, for all the measured executions, the relative difference between the real and simulated execution time is less than 30%. Most importantly for the search, we test different strategies for a given DNN application and show that their simulated execution time preserves real execution time ordering.
Using the execution simulator as an oracle, the FlexFlow execution optimizer
uses a general Markov Chain Monte Carlo (MCMC) search algorithm (other search strategies could also be used) to explore the SOAP search space and iteratively propose candidate strategies based on the simulated performance of previous candidates. When the search procedure is finished, the execution optimizer returns the best strategy it has discovered.
We evaluate FlexFlow on a variety of real-world DNN benchmarks including image classification [28, 22, 40], text classification , language modeling , and neural machine translation . Compared to data/model parallelism and expert-designed parallelization strategies [27, 42], FlexFlow increases training throughput by up to 3.3, reduces communication costs by up to 5, and achieves significantly better scaling. In addition, FlexFlow also outperforms the strategies found by REINFORCE by 3.4-3.8 on the same hardware configuration evaluated in REINFORCE, and outperforms OptCNN by 1.2-1.6, by supporting a broader search space.
To summarize, our contributions are:
We define the SOAP search space for parallelizing DNN applications, which includes strategies that parallelize in any combination of the sample, operation, attribute, and parameter dimensions.
We show that under reasonable assumptions it is possible to reliably predict the execution time of parallelized DNNs using a simulator that is three orders of magnitude faster than actually running the DNNs directly on the hardware.
We describe FlexFlow, a deep learning framework that can search for and execute strategies from the entire SOAP space to accelerate DNN training.
We show that FlexFlow can increase training throughput by up to 3.8 over state-of-the-art parallelization approaches while improving scalability.
2 Related Work
Data and model parallelism have been widely used by existing deep learning systems (e.g., TensorFlow , Caffe2 , and PyTorch ) to distribute the training process across devices. Data parallelism  keeps a copy of an entire DNN on each device, which is inefficient for operations with a large number of parameters (e.g., densely-connected layers) and becomes a scalability bottleneck in large scale distributed training. Model parallelism [9, 15] splits a DNN into disjoint subsets and trains each subset on a dedicated device, which reduces communication costs for synchronizing network parameters in a DNN but exposes limited parallelism.
Expert-designed parallelization strategies manually optimize parallelization for specific DNNs by using experts’ domain knowledge and experience. For example, 
introduces “one weird trick” that uses data parallelism for convolutional and pooling layers and switches to model parallelism for densely-connected layers to accelerate convolutional neural networks. To parallelize recurrent neural networks, uses data parallelism that replicates the entire DNN on each compute node and switches to model parallelism for intra-node parallelization. Although these expert-designed parallelization strategies achieve performance improvement over data and model parallelism, they are suboptimal. We use these expert-designed strategies as baselines in our experiments and show that FlexFlow can further improve training performance by up to 2.3.
Automated frameworks have been proposed for finding efficient parallelization strategies in a limited search space. REINFORCE  uses reinforcement learning to find efficient device placement for model parallelism. OptCNN  is designed for parallelizing DNNs with linear computation graphs and automatically finds efficient strategies that exploit parallelism within an operation.
Figure 1 summarizes the parallelism dimensions explored by existing approaches. Data parallelism uses the sample dimension to parallelize the training process, while model parallelism exploits the parameter and operation dimensions. Expert-designed strategies [27, 42] exploit parallelism in the sample or parameter dimension to parallelize an operation but do not support hybrid parallelism that uses a combination of the sample, attribute, and parameter dimensions to parallelize an operation (see Figure 3). Compared to these manually designed strategies, FlexFlow considers more sophisticated, and often more efficient, strategies to parallelize a single operation. In addition, compared to existing automated frameworks [33, 25], FlexFlow explores a significantly broader search space and is able to find strategies that are up to 3.8 faster.
Graph-based cluster schedulers. Previous work [24, 18] has proposed cluster schedulers that schedule cluster-wide tasks by using graph-based algorithms. For example, Quincy  maps task scheduling to a flow network and uses a min-cost max-flow (MCMF) algorithm to find efficient task placement. Firmament  generalizes Quincy by employing multiple MCMF optimization algorithms to reduce task placement latencies. Existing graph-based schedulers optimize task placement by assuming a fixed task graph. However, FlexFlow solves a different problem that requires jointly optimizing how to partition an operation into tasks by exploiting parallelism in the SOAP dimensions and how to assign tasks to devices.
In this section, we compare the FlexFlow programming interface with other frameworks in Section 3.1, provide a general overview of FlexFlow in Section 3.2, and discuss the limitations of our approach in Section 3.3.
3.1 Programming Interface
Similar to existing deep learning systems [7, 6, 2], FlexFlow uses an operator graph to describe all operations and state in a DNN. Each node is an operation (e.g., matrix multiplication, convolution, etc.), and each edge
is a tensor (i.e., a-dimensional array) that is an output of and an input of .
As far as we know, most deep learning systems (e.g., TensorFlow , PyTorch , and Caffe2 ) use data parallelism as the default parallelization strategy and support model parallelism as an alternative by allowing users to manually specify the device placement for each operation.
In contrast, FlexFlow takes a device topology describing all available hardware devices and their interconnections, as shown in Figure 2. Each node represents a device (e.g., a CPU or a GPU), and each edge is a hardware connection (e.g., a NVLink, a PCI-e, or a network link) between device and . The edges are labeled with the bandwidth and latency of the connection.
FlexFlow automatically finds a parallelization strategy for an operator graph and a device topology. Compared to existing frameworks, FlexFlow has two advantages:
Programmability. For DNN applications with complex operator graphs running on clusters with deep device topologies, it is difficult for application developers, even domain experts, to manually design efficient operation assignments. FlexFlow takes the responsibility for finding efficient parallelization strategies and provides a more productive programming interface.
Portability. A parallelization strategy fine-tuned for one cluster may behave poorly on other clusters. FlexFlow’s search method automatically selects an efficient strategy for each hardware configuration, without requiring application changes.
3.2 FlexFlow Architecture
The main components of FlexFlow are shown in Figure 2. The FlexFlowexecution optimizer takes an operator graph and a device topology as inputs and automatically generates an efficient parallelization strategy. The optimizer uses a MCMC search algorithm to explore the space of possible parallelization strategies and iteratively proposes candidate strategies that are evaluated by a execution simulator. The execution simulator uses a delta simulation algorithm that simulates a new strategy using incremental updates to previous simulations. The simulated execution time guides the search in generating future candidates. When the search time budget is exhausted, the execution optimizer sends the best discovered strategy to a distributed runtime for parallelizing the actual executions.
The main limitation of our approach is that the execution simulator assumes the execution time of each operation is predictable and independent of the contents of input tensors, as we discuss in Section 5. Therefore, our approach may not be applicable to applications whose execution time is data dependent. However, for the DNN applications that are the subject of study here, which are based on dense matrix operations, execution time is highly predictable and independent of the contents of the matrices.
4 The SOAP Search Space
This section introduces the SOAP search space of parallelization strategies for DNNs. To parallelize a DNN operation across devices, we require each device to compute a disjoint subset of the operation’s output tensors. Therefore, we model the parallelization of an operation by defining how the output tensor of is partitioned.
|1D pooling||sample||length, channel|
|2D convolution||sample||height, width||channel|
dimension index different samples and neurons in a tensor, respectively. For 1D and 2D images, thelength and the combination of height and width dimensions specify a position in an image.
For an operation , we define its parallelizable dimensions as the set of all divisible dimensions in its output tensor. always includes a sample dimension. For all other dimensions in , we call it a parameter dimension if partitioning over that dimension requires splitting the model parameters and call it an attribute dimension otherwise. Table 1 shows the parallelizable dimensions of some example operations. Finally, we also consider parallelism across differ operations in the operation dimension.
A parallelization configuration of an operation defines how the operation is parallelized across multiple devices. Figure 3 shows some example configurations for parallelizing a 1D convolution operation in a single dimension as well as combinations of multiple dimensions.
For each parallelizable dimension in , includes a positive integer that is the degree of parallelism in that dimension. is the product of the parallelism degrees for all parallelizable dimensions of . We use equal size partitions in each dimension to guarantee well-balanced workload distributions. A parallelization configuration partitions the operation into independent tasks, denoted as , meanwhile also includes the device assignment for each task (). Given the output tensor of a task and its operation type, we can infer the necessary input tensors to execute each task.
Figure 4 shows an example parallelization configuration for a matrix multiplication operation (i.e., ). The operation is partitioned into four independent tasks assigned to dedicated GPU devices. The input and output tensors of the tasks are shown in the figure.
A parallelization strategy describes one possible parallelization of an application. includes a parallelization configuration for each operation , and each ’s configuration can be chosen independently from among all possible configurations for .
5 Execution Simulator
In this section, we describe the execution simulator, which takes an operator graph , a device topology , and a parallelization strategy as inputs and predicts the execution time to run on using strategy . FlexFlow simulates the execution process instead of measuring the elapsed time from real executions for two reasons. First, processing one iteration of a DNN application can take seconds even on modern GPUs [19, 7]. The simulator runs up to three orders of magnitude faster than real executions and allows the execution optimizer to explore many more candidates in a given time budget. Second, the execution simulator requires fewer computation resources. A large-scale execution on thousands of devices can be simulated on a single node.
The simulator depends on the following assumptions:
The execution time of each task is predictable with low variance and is independent of the contents of input tensors.
For each connection between device and with bandwidth , transferring a tensor of size from to takes time (i.e., the communication bandwidth can be fully-utilized).
Each device processes the assigned tasks with a FIFO (first-in-first-out) scheduling policy. This is the policy used by modern devices such as GPUs.
The runtime has negligible overhead. A device begins processing a task as soon as its input tensors are available and the device has finished previous tasks.
To simulate an execution, the simulator first builds a task graph, which includes all tasks derived from operations and dependencies between tasks, and then runs a simulation algorithm to generate an execution timeline. Section 5.1 describes task graph construction. Section 5.2 introduces a full simulation algorithm that builds timelines from scratch. Finally, Section 5.3 introduces an alternative delta simulation algorithm that generates a new timeline using incremental updates to a previous one.
5.1 Task Graph
A task graph models dependencies between individual tasks derived from operations and can also represent task execution timelines on individual devices. To unify the abstraction, we treat each hardware connection between devices as a communication device, and each data transfer as a communication task. Note that devices and hardware connections are modeled as separate devices. This allows computation (i.e., normal tasks) and communication (i.e., communication tasks) to be overlapped if possible.
Given an operator graph , a device topology , and a parallelization strategy , we use the following steps to construct a task graph , where each node is a task (i.e., a normal task or a communication task) and each edge is a dependency that task cannot start until task is completed. Note that the edges in the task graph are simply ordering constraints—the edges do not indicate data flow, as all data flow is included in the task graph as communication tasks.
For each operation with parallelization configuration , we add tasks into .
For each tensor , which is an output of operation and an input of , we compute the output sub-tensors written by tasks () and the input sub-tensors read by tasks (). For every task pair and with shared tensors, if two tasks are assigned to the same device, we add an edge into , indicating a dependency between the two tasks, and no communication task is needed. If and with shared tensors are assigned to different devices, we add a communication task to and two edges and to . The new task is assigned to the communication device between the devices that perform and .
|Properties set in graph construction|
|exeTime||The elapsed time to execute the task.|
|device||The assigned device of the task.|
|Properties set in simulation|
|readyTime||The time when the task is ready to run.|
|startTime||The time when the task starts to run.|
|endTime||The time when the task is completed.|
|preTask||The previous task performed on device.|
|nextTask||The next task performed on device.|
|Internal properties used by the full simulation algorithm|
|state||Current state of the task, which is one of|
|NOTREADY, READY, and COMPLETE.|
Figure (a)a shows an example parallelization strategy for a standard 3-layer recurrent neural network consisting of an embedding layer, a recurrent layer, and a linear layer. The parallelization strategy represents commonly used model parallelism that assigns operations in each layer to a dedicated GPU. Figure (b)b shows the corresponding task graph. Each square and hexagon indicate a normal task and a communication task, respectively, and each directed edge represents a dependency between tasks.
Table 2 lists the properties for each task in the task graph. The exeTime property is set during the graph construction. For a normal task derived from an operation, its exeTime is the time to execute the task on the given device and is estimated by running the task multiple times on the device and measuring the average execution time (assumption A1). A task’s exeTime is cached, and all future tasks with the same operation type and output size will use the cached value without rerunning the task. For a communication task, its exeTime is the time to transfer a tensor (of size ) between devices with bandwidth and is estimated as (assumption A2).
5.2 Full Simulation Algorithm
We now describes a full simulation algorithm that we will use as a baseline for comparisons with our delta simulation algorithm. Algorithm 1 shows the pseudocode. It first builds a task graph using the method described in Section 5.1 and then sets the properties for each task using a variant of Dijkstra’s shortest-path algorithm . Tasks are enqueued into a global priority queue when ready (i.e., all predecessor tasks are completed) and are dequeued in increasing order by their readyTime. Therefore, when a task is dequeued, all tasks with an earlier readyTime have been scheduled, and we can set the properties for task while maintaining the FIFO scheduling order (assumption A3). Figure (c)c shows the execution timeline of the example parallelization strategy.
5.3 Delta Simulation Algorithm
FlexFlow uses a MCMC search algorithm that proposes a new parallelization strategy by changing the parallelization configuration of a single operation in the previous strategy (see Section 6.2). As a result, in the common case, most of the execution timeline does not change from one simulated strategy to the next. Based on this observation, we introduce a delta simulation algorithm that starts from a previous task graph and only re-simulates tasks involved in the portion of the execution timeline that changes, an optimization that dramatically speeds up the simulator, especially for strategies for large distributed machines. The full and delta simulation algorithms always produce the same timeline for a given task graph.
Algorithm 2 shows the pseudocode for the delta simulation algorithm. It first updates tasks and dependencies in the task graph and enqueues all modified tasks into a global priority queue (line 4-5). Similar to the Bellman-Ford shortest-path algorithm , the delta simulation algorithm iteratively dequeues updated tasks and propagates the updates to subsequent tasks (line 6-14).
For the example in Figure 5, consider a new parallelization strategy derived from the original strategy (Figure (a)a) by only reducing the parallelism of operation to 1 (i.e., = 1). Figure (d)d shows the task graph for the new parallelization strategy, which can be generated from the original task graph (in Figure (c)c) by updating the simulation properties of tasks in the grey area.
6 Execution Optimizer
|DNN||Description||Dataset||Reported Acc.||Our Acc.|
|Convolutional Neural Networks (CNNs)|
|AlexNet ||A 12-layer CNN||Synthetic data||-||-|
A 102-layer CNN with Inception modules
|ResNet-101 ||A 101-layer residual CNN with shortcut connections||ImageNet ||76.4%a||76.5%a|
|Recurrent Neural Networks (RNNs)|
4 recurrent layers followed by a softmax layer
|Movie Reviews ||79.8%||80.3%|
|RNNLM ||2 recurrent layers followed by a softmax layer||Penn Treebank ||78.4b||76.1b|
|NMT ||4 recurrent layers followed by an attention and a softmax layer||WMT English-German ||19.67c||19.85c|
top-1 accuracy for single crop on the validation dataset (higher is better).
word-level test perplexities on the Peen Treebank dataset (lower is better).
BLEU scores  on the test dataset (higher is better).
This section describes the execution optimizer that takes an operator graph and a device topology as inputs and automatically finds an efficient parallelization strategy. Using the simulator as an oracle, FlexFlow transforms the parallelization optimization problem into a cost minimization problem, namely minimizing the predicted execution time. The primary advantage of this approach is that it avoids explicitly encoding the trade-offs between interdependent optimizations (e.g., reducing data transfers v.s. balancing workload distributions) and simply focuses on minimizing the application’s overall execution time.
, the number of possible strategies is exponential to the number of operations in the operator graph, which makes it intractable to exhaustively enumerate the search space. To find a low-cost strategy, FlexFlow uses a cost minimization search procedure to heuristically explore the space and returns the best strategy discovered.
6.1 MCMC Sampling
This section briefly introduces the MCMC sampling method used by the execution optimizer. MCMC sampling is a technique for obtaining samples from a probability distribution so that higher probability samples are visited proportionately more often than low probability samples. A common method (described in) to transform a cost function into a probability distribution is the following, where is a constant that can be chosen:
MCMC works by starting at any point in the search space (a random point, or perhaps a well-known starting point) and then generating a sequence of points with the guarantee that in the limit the set of points visited approaches the distribution given by . In our setting, we begin with some parallelization strategy and then generate a sequence of strategies .
We use the Metropolis-Hastings algorithm  for generating Markov chains, which maintains a current strategy and proposes a modified strategy from a proposal distribution . If the proposal is accepted, becomes the new current strategy, otherwise another strategy based on is proposed. This process is repeated indefinitely (e.g., until a time budget is exhausted). If the proposal distribution is symmetric, , the acceptance criteria of a new strategy is the following:
The acceptance criteria has several important properties. If has a lower cost than , then is always accepted. If has a higher cost than , then may still be accepted with a probability that decreases as a function of the difference between and . Intuitively, MCMC tends to behave as a greedy search algorithm, preferring to move towards lower cost whenever that is readily available, but can also escape local minima.
6.2 Search Algorithm
Our method for generating proposals is simple: an operation in the current parallelization strategy is selected at random, and its parallelization configuration is replaced by a random configuration. Our definition of the proposal distribution satisfies the symmetry property, , since, for any operation, its configurations are selected with the same probability.
We uses existing strategies (e.g., data parallelism, expert-designed strategies) as well as randomly generated strategies as the initial candidates for the search algorithm. For each initial strategy, the search algorithm iteratively proposes new candidates until one of the following two criteria is satisfied: (1) the search time budget for current initial strategy is exhausted; or (2) the search procedure cannot further improve the best discovered strategy for half of the search time.
7 FlexFlow Runtime
We found that existing deep learning systems (e.g., TensorFlow , PyTorch , Caffe2 , and MXNet ) only support parallelizing an operation in the batch dimension through data parallelism, and it is non-trivial to parallelize an operation in other dimensions or combinations of several dimensions in these systems. In addition, we are not aware of any existing system that supports parallelization at the granularity of individual operations.
To support parallelizing DNN models using any strategy defined in our parallelization space (see Section 4), we implemented the FlexFlow distributed runtime in Legion , a high-performance parallel runtime for distributed heterogeneous architectures, and use cuDNN  and cuBLAS  as the underlying libraries for processing DNN operations. We use the Legion high-dimensional partitioning interface  to support parallelizing an operation in any combination of the parallelizable dimensions and use Legion’s fine-grain control mechanism to control parallelization at the granularity of each operation.
The key difference between the FlexFlow runtime and existing systems is that FlexFlow supports parallelizing an operation in any combination of the parallelizable dimensions and controls parallelization at the granularity of individual operations.
This section evaluates the performance of FlexFlow on six real-world DNN benchmarks and two GPU clusters. Section 8.1 describes the experimental setup for the evaluation. Section 8.2 compares FlexFlow with state-of-the-art parallelization approaches. Section 8.3 evaluates the accuracy and efficiency of the execution simulator. Sections 8.4 and 8.5 evaluate the quality of the best strategies discovered by the execution optimizer and discuss two of the best discovered strategies.
8.1 Experimental Setup
Table 3 summarizes the DNNs used in our experiments. AlexNet, Inception-v3, and ResNet-101 are three CNNs that achieved the best accuracy in the ILSVRC competitions . For AlexNet, the per-iteration training time is smaller than the time to load training data from disk. We follow the suggestions in  and use synthetic data to benchmark the performance of AlexNet. For all other experiments, the training data is loaded from disk in the training procedure.
RNNTC, RNNLM and NMT are sequence-to-sequence RNN models for text classification, language modeling, and neural machine translation, respectively. RNNTC uses four LSTM layers with a hidden size of 1024. RNNLM uses two LSTM layers with a hidden size of 2048. Both RNN models include a softmax linear after the last LSTM layer. NMT includes an encoder and a decoder, both of which consist of 2 LSTM layers with a hidden size of 1024. To improve model accuracy, we also use an attention layer  on top of the last decoder LSTM layer. Figure 14 illustrates the structure of the NMT model. For all three RNN models, we set the number of unrolling steps for each recurrent layer to 40.
to construct operator graphs and set hyperparameters (e.g., learning rates, weight decays). We use synchronous training and a batch size of 64 for all DNN benchmarks, except for AlexNet, which uses a batch size of 256.
To evaluate the performance of FlexFlow with different device topologies, we performed the experiments on two GPU clusters, as shown in Figure 6. The first cluster contains 4 compute nodes, each of which is equipped with two Intel 10-core E5-2600 CPUs, 256GB main memory, and four NVIDIA Tesla P100 GPUs. GPUs on the same node are connected by NVLink, and nodes are connected over 100GB/s EDR Infiniband. The second cluster consists of 16 nodes, each of which is equipped with two Intel 10-core E5-2680 GPUs, 256GB main memory, and four NVIDIA Tesla K80 GPUs. Adjacent GPUs are connected by a separate PCI-e switch, and all GPUs are connected to CPUs through a shared PCI-e switch. Compute nodes in the cluster are connected over 56 GB/s EDR Infiniband.
Unless otherwise stated, we set 30 minutes as the time budget for the execution optimizer and use data parallelism and a randomly generated parallelization strategy as the initial candidates for the search algorithm. As shown in Section 8.3.2, the search procedure terminates in a few minutes for most executions.
8.2 Parallelization Performance
8.2.1 Per-iteration Performance
We compare the per-iteration training performance of FlexFlow with the following baselines. Data parallelism is commonly used in existing deep learning systems [7, 2, 6]. To control for implementation differences, we ran data parallelism experiments in TensorFlow r1.7, PyTorch v0.3, and our implementation and compared the performance numbers. Compared to TensorFlow and PyTorch, FlexFlow achieves the same or better performance numbers on all six DNN benchmarks, and therefore we report the data parallelism performance achieved by FlexFlow in the experiments.
Expert-designed strategies optimize parallelization based on domain experts’ knowledge and experience. For CNNs,  uses data parallelism for parallelizing convolutional and pooling layers and switches to model parallelism for densely-connected layers. For RNNs,  uses data parallelism that replicates the entire operator graph on each compute node and uses model parallelism that assign operations with the same depth to the same GPU on each node. These expert-designed strategies are used as a baseline in our experiments. Model parallelism only exposes limited parallelism by itself, and we compare against model parallelism as a part of these expert-designed strategies.
Figure 7 shows the per-iteration training performance on all six DNN benchmarks. For ResNet-101, FlexFlow finds strategies similar to data parallelism (except using model parallelism on a single node for the last fully-connected layer) and therefore achieves similar parallelization performance. For other DNN benchmarks, FlexFlow finds more efficient strategies than the baselines and achieves 1.3-3.3 speedup. Note that FlexFlow performs the same operations as data parallelism and expert-designed strategies, and the performance improvement is achieved by using faster parallelization strategies. We found that the parallelization strategies discovered by FlexFlow have two advantages over data parallelism and expert-designed strategies.
Reducing overall communication costs. Similar to existing deep learning systems, the FlexFlow distributed runtime supports overlapping data transfers with computation to hide communication overheads. However, as we scale the number of devices, the communication overheads increase, but the computation time used to hide communication remains constant. Therefore, reducing overall communication costs is beneficial for large-scale distributed training. Figure (b)b shows that, to parallelize the NMT model on 64 K80 GPUs (16 nodes), FlexFlow reduces the per-iteration data transfers by 2-5.5 compared to other parallelization approaches.
Reducing overall task computation time. Data parallelism always parallelizes an operation in the batch dimension. However, as reported in , parallelizing an operation through different dimensions can result in different task computation time. For the matrix multiplication operation in the NMT model, parallelizing it in the channel dimension reduces the operation’s overall computation time by 38% compared to parallelizing the operation in the batch dimension. Figure (c)c shows that FlexFlow reduces the overall task computation time by 20% compared to data parallelism for the NMT model. The expert-designed strategy achieves slightly better total task computation time than FlexFlow. However, this is achieved by using model parallelism on each node, which disables any parallelism within each operation and results in imbalanced workloads. As a result, the expert-designed strategy achieves even worse execution performance than data parallelism (see Figure (a)a). FlexFlow reduces the overall task computation time while enabling parallelism within an operation and maintaining load balance.
8.2.2 End-to-end Performance
FlexFlow performs the same computation as other deep learning systems for a DNN model and therefore achieves the same model accuracy. Table 3 verifies that FlexFlow achieves the state-of-the-art accuracies on the DNN benchmarks used in the experiments.
In this experiment, we compare the end-to-end training performance between FlexFlow and TensorFlow on Inception-v3. We train Inception-v3 on the ImageNet dataset until the model reaches the single-crop top-1 accuracy of 72% on the validation set. The training processes in both frameworks use stochastic gradient decent (SGD) with a learning rate of 0.045 and a weight decay of 0.0001. Figure 9 illustrates the training curves of the two systems on Inception-v3 and show that FlexFlow reduces the end-to-end training time by 38% compared to TensorFlow.
8.2.3 Automated Parallelization Optimizer
We compare against two automated frameworks that find parallelization strategies in a limited search space.
REINFORCE  uses reinforcement learning to learn device placement for model parallelism. We are not aware of any publicly available implementation of REINFORCE, so we compare against the learned device placement for Inception-v3 and NMT, as reported in .
Figure (a)a compares the training throughput of the strategies found by FlexFlow and REINFORCE for four K80 GPUs on a single node. The parallelization strategies found by FlexFlow achieve 3.4 - 3.8 speedup compared to REINFORCE. We attribute the performance improvement to the larger search space explored by FlexFlow.
Besides improving training performance, FlexFlow has two additional advantages over REINFORCE. First, REINFORCE requires executing each strategy in the hardware environment to get reward signals and takes 12-27 hours to find the best placement , while the FlexFlow execution optimizer finds efficient parallelization strategies for these executions in 14-40 seconds. Second, REINFORCE uses up to 160 compute nodes (with 4 GPUs on each node) to find the placement in time, while FlexFlow uses a single compute node to run the execution optimizer.
OptCNN  optimizes parallelization for DNNs with linear operator graphs. OptCNN assumes that different operations in an operator graph cannot be performed in parallel and estimates a DNN’s execution time as the sum of the operations’ computation time and synchronization time and the tensors’ data transfer time. This assumption allows OptCNN to use a dynamic programming algorithm to find an efficient parallelization strategy.
We compare the strategies found by FlexFlow and OptCNN for different DNNs on 16 P100 GPUs. The frameworks found the same parallelization strategies for AlexNet and ResNet with linear operator graphs and found different strategies for the other DNNs as shown in Figure (b)b. For these DNNs with non-linear operator graphs, FlexFlow achieves 1.2-1.6 speedup compared to OptCNN by using parallelization strategies that exploit parallelism across different operations. We show two examples in Section 8.5.
8.3 Execution Simulator
We evaluate the performance of the simulator using two metrics: simulator accuracy and simulator execution time.
8.3.1 Simulator Accuracy
In this experiment, we compare the estimated execution time predicted by the execution simulator with the real execution time measured by actual executions. Figure 11 shows the results for different DNNs and different available devices. The dashed lines indicate a relative difference of 0% and 30%, respectively, which encompasses the variance between actual and predicted execution time. In addition, for different parallelization strategies with the same operator graph and device topology (i.e., points of the same shape in the figure), their simulated execution time preserves actual execution time ordering, which shows that simulated execution time is an appropriate metric to evaluate the performance of different strategies.
8.3.2 Simulator Execution Time
Figure 12 shows the search performance with different simulation algorithms for finding a strategy for the NMT model on 16 P100 GPUs on 4 nodes. The full and delta simulation algorithms terminate in 16 and 6 minutes, respectively. If the allowed time budget is less than 8 minutes, the full simulation algorithm will find a worse strategy than the delta simulation algorithm.
We compare the end-to-end search time of the execution optimizer with different simulation algorithms. For a given DNN model and device topology, we measure the average execution time of the optimizer using 10 random initial strategies. The results are shown in Table 4. The delta simulation algorithm is 2.2-6.9 faster than the full simulation algorithm. Moreover, the speedup over the full simulation algorithm increases as we scale the number of devices.
8.4 Search Algorithm
This section evaluates the quality of the best parallelization strategies discovered by the search algorithm.
First, we compare the best discovered strategies with the global optimal strategies for small executions. To obtain a search space of reasonable size, we limit the number of devices to 4 and consider the following two DNNs. LeNet  is a 6-layer CNN for image classification. The second DNN is a variant of RNNLM where the number of unrolling steps for each recurrent layer is restricted to 2. The search space for both DNNs contains approximately strategies. We use depth-first search to explore the search space and use A  to prune the search space. Finding the optimal strategies for LeNet and RNNLM took 0.8 and 18 hours, respectively. For both DNNs, FlexFlow finds the global optimal strategy.
Second, we test if the search algorithm returns at least a locally optimal strategy in larger search spaces by comparing the best discovered strategy with all of its neighbors. For this experiment, we consider all six DNNs on 2, 4, and 8 devices, where the number of neighbors remains small enough to exhaustively enumerate them all. All the strategies returned by FlexFlow were locally optimal.
8.5 Case Studies
We discuss the best strategies discovered by FlexFlow and how they improve parallelization performance.
Inception-v3. Figure 13 shows the best discovered strategy for parallelizing Inception-v3 on four P100 GPUs on a single node, which exploits intra-operation parallelism for operations on the critical path and uses a combination of intra- and inter-operation parallelism for operations on different branches. This results in a well-balanced workload and reduces data transfers for parameter synchronization. Compared to data parallelism, this strategy reduces the parameter synchronization costs by 75% and the per-iteration execution time by 12%.
For parallelizing the same Inception-v3 model on four K80 GPUs with asymmetric connections between GPUs (see Figure (b)b), we observe that the best discovered strategy tends to parallelize operations on adjacent GPUs with a direct connection to reduce the communication costs.
NMT. Figure 14 shows the best discovered strategy for parallelizing NMT on four P100 GPUs, which uses various strategies for parallelizing different layers. We briefly discuss the insights from this strategy. First, for a layer with a large number of network parameters and little computation (e.g., the embed layer), it is beneficial to perform the computation on a small number of GPU devices to reduce parameter synchronization costs. Second, for a layer with a large number of network parameters and a heavy computation workload (e.g., the softmax layer), FlexFlow uses parallelism in the channel dimension and assigns the computation for a subset of channels to each task. This allows each device to use a subset of the network parameters, which reduces parameter synchronization costs while maintaining load balance. Third, for multiple recurrent layers (e.g., the LSTM and attention layers), FlexFlow uses concurrency among different layers as well as parallelism within each operation to cooperatively reduce parameter synchronization costs while balancing load.
This paper presents FlexFlow, a deep learning system that automatically finds efficient parallelization strategies for DNN applications. FlexFlow uses a guided randomized search procedure to explore the space of possible strategies and includes an execution simulator that is an efficient and accurate predictor of DNN performance. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show FlexFlow significantly outperforms state-of-the-art parallelization approaches.
-  Movie review data. https://www.cs.cornell.edu/people/pabo/movie-review-data/, 2005.
-  A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai, 2016.
-  Conference on machine translation. http://www.statmt.org/wmt16, 2016.
-  Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas, 2016.
-  TensorFlow Benchmarks. https://www.tensorflow.org/performance/benchmarks, 2017.
-  Tensors and Dynamic neural networks in Python with strong GPU acceleration. https://pytorch.org, 2017.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,
M. Wicke, Y. Yu, and X. Zheng.
Tensorflow: A system for large-scale machine learning.In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI, 2016.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. V. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML’16.
-  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
-  M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012.
-  C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, and P. Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
-  S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
-  T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
-  J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In , CVPR, 2009.
-  W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. CRC press, 1995.
-  I. Gog, M. Schwarzkopf, A. Gleave, R. N. M. Watson, and S. Hand. Firmament: Fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 99–115, Savannah, GA, 2016. USENIX Association.
-  P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
-  A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14, 2014.
-  W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016.
-  G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
-  M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, pages 261–276. ACM, 2009.
-  Z. Jia, S. Lin, C. R. Qi, and A. Aiken. Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR, abs/1802.04924, 2018.
-  Y. Kim. Convolutional neural networks for sentence classification. CoRR, abs/1408.5882, 2014.
-  A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS, 2012.
-  S. Lam and R. Sethi. Worst case analysis of two scheduling algorithms. SIAM Journal on Computing, 6, 1977.
-  Y. LeCun. LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 2015.
-  M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19.
-  A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean. A hierarchical model for device placement. In International Conference on Learning Representations, 2018.
-  A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. 2017.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 2002.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  S. Treichler, M. Bauer, R. Sharma, E. Slaughter, and A. Aiken. Dependent partitioning. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA’ 16. ACM, 2016.
-  Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
-  W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. CoRR, abs/1409.2329, 2014.