Deep learning has been undergoing rapid developments in recent years [resnet, transformer, AmoebaNet] and the state-of-the-art deep neural networks (DNNs) are becoming increasingly difficult to train. Vision models [VeryDeepCNN] can take weeks to train on a single GPU [inception] and language models [googleNMT, nmt] can consume hundreds of gigabytes (GBs) of memory [gpipe]. The demands for intensive computation and large memory call for distributed training with multiple devices, which is typically conducted in GPU clusters [dean2012large, Tensorflow].
A fundamental problem of distributed DNN training is finding a good parallelization strategy. A parallelization strategy is a partitioning and assignment of the operators of a DNN model to devices (e.g., GPUs), and the strategy is associated with its runtime costs for training the DNN model, including execution time, memory consumption and network communication time111Execution time refers to the time taken for a mini-batch and memory consumption is the peak memory consumption for training a mini-batch, which includes both the model parameter and activations.. Two simple and widely used parallelization strategies are data parallelism and model parallelism [dean2012large, kim2017splitnet]. Data parallelism keeps a copy of the entire model on each device and synchronizes the model copies in each mini-batch. Model parallelism assigns a disjoint set of layers to each device and communicates the activations across the devices. However, data parallelism is inefficient for layers with large parameters (e.g., fully connected layers) and model parallelism suffers from high communication cost when the activations are large (e.g., convolution layers).
Recently, more advanced methods [OptCNN, FlexFlow, Tofu] have been proposed to find parallelization strategies that are much more efficient than simple data parallelism and model parallelism. We call these methods auto-parallel or auto-parallelism. Their success lies in searching the large space of possible parallelization strategies using well-designed algorithms. OptCNN [OptCNN]
minimizes the execution time for training convolutional neural networks (CNNs) with a dynamic programming algorithm. FlexFlow[FlexFlow]
considers a more diverse set of DNN models, e.g., recurrent neural networks (RNNs)[lstm]
, and minimizes the execution time using a randomized Markov Chain Monte Carlo (MCMC) search algorithm. ToFu[Tofu] focuses on training large models and minimizes the memory consumption. As training large models (those that can not be placed in memory of a single device) is becoming increasingly important [pipedream, gpipe], TensorFlow provides the Mesh-TensorFlow library [MeshTensorFlow] to allow users to program their own parallelization strategies.
Existing works only optimize a single objective (execution time or memory consumption), which results in limited flexibility to adapt to different scenarios. For example, when training large models using a small number of devices, simply minimizing the execution time could result in memory overflow. We also found that methods that minimize the memory consumption cannot fully utilize additional memory resource to reduce the execution time. In some cases, it is important to track the trade-offs among different objectives. For example, knowing the minimum execution time of a training job when using different amounts of resources (e.g., memory and number of devices) can help us make resource allocation decisions in a shared GPU cluster. When training DNNs on the cloud, users need to know the trade-offs between the cost (i.e., resources) and the efficiency (i,e., execution time) to determine the amount of resources to purchase. Therefore, the algorithm should be flexible enough to find parallelization strategies according to specific scenarios and user preferences (on the cost-efficiency trade-off), rather than optimizing a single objective. Moreover, an auto-parallel system should make finding and programming parallelization strategies transparent to users as both tasks require a deep understanding of distributed DNN training.
In this paper, we make three main contributions. First, we formulate the concept of cost frontier and propose the Frontier-Tracking (FT) algorithm to find the cost frontier efficiently. For a given DNN model and device configuration, the cost frontier is a minimum set of parallelization strategies, , such that given any parallelization strategy , there exists a strategy in that gives a smaller or equal cost in every dimension (e.g., execution time, memory consumption) than . Thus, parallelization strategies outside the cost frontier are not attractive as we can always find strategies in the frontier that outperform them. The cost frontier also provides a continuum for the trade-offs among different objectives and allows users to flexibly choose a parallelization strategy according to their scenario (e.g., resource availability in a cluster, cloud resource budget). As the complexity of finding the cost frontier by brute-force search is exponential (w.r.t. the number of operators in a given DNN model), the FT algorithm adopts a carefully designed dynamic programming procedure for efficient cost frontier tracking. Our analysis shows that the complexity of the FT algorithm is only quadratic in the number of operators in a given DNN model.
Second, we propose a flexible and user-friendly auto-parallel system called TensorOpt
. TensorOpt uses TensorFlow as the underlying execution engine and its API is almost identical to TensorFlow, so that users only need to make a few changes to run their TensorFlow scripts as auto-parallel jobs on TensorOpt. TensorOpt also makes parallelization strategy search and implementation totally transparent to users by using the FT algorithm for strategy search and generating the low-level execution graph automatically according to the chosen parallelization strategy. Users only need to specify their preference for the parallelization strategy via some high-level options. By removing the tensor split restrictions in MeshTensorFlow[MeshTensorFlow], TensorOpt allows a larger space for parallelization strategy search and hence better performance.
Third, we conducted extensive experiments to characterize the cost frontier and validate the effectiveness of the FT algorithm and the TensorOpt system. For all the models we experimented, we found that there exists a sharp turning point in the trade-off between memory consumption and execution time. The execution time increases rapidly when available memory is below the turning point but drops slowly when more memory is provided. We also found that both inter-machine and intra-machine communication bandwidth play a decisive role in the efficiency of distributed DNN training. Thanks to the FT algorithm, TensorOpt is flexible in adapting to different scenarios, i.e., TensorOpt can choose strategies to minimize memory consumption when the number of device is limited and fully utilize additional resources to minimize execution time. Moreover, both the FT algorithm and TensorOpt system have good efficiency.
2 Background and Related Work
We first provide some background. Then we discuss related work and their limitations, which motivate our work.
2.1 Parallelization Strategy and Execution Cost
We first define the notations used in this paper. The computation devices (e.g., GPUs) are modeled as a device graph , with each node being a device and each edge being the network connection between and . A DNN is modeled as a computation graph , in which nodes are operators and a directed edge means that the output tensor of operator is used as the input for operator . We focus on synchronous training, although our method can also be extended to asynchronous training (e.g., as in PipeDream [pipedream]) by changing the cost functions.
Parallelization configurations. A parallelization strategy contains a parallelization configuration for each operator in the computation graph and determines how the devices execute the training job. is selected from a set that contains valid parallelization configurations for , where . More specifically, a parallelization configuration consists of a device mesh and some tensor maps, which jointly describe how the tensors (both input and model parameter) related to an operator are split among the devices. Following MeshTensorFlow [MeshTensorFlow], the device mesh is an integer array used to describe the logical organization of the devices. For example, 4 GPUs can be represented as  (as a one-dimensional array) or [2, 2] (as a two-dimensional array). A tensor map is an integer array with its size being the dimension of the tensor and describes how each dimension of the tensor is split on the device mesh. Consider an operator that computes matrix-vector product (with the matrix being the model parameter) with an input size of [200, 100], where 200 is the batch size and 100 is the vector length. With a device mesh [2, 2], a tensor map of [0, 1] for the input tensor means that the first dimension of the input is split across the first dimension of the device mesh, and the second dimension of the tensor is split across the second dimension of the mesh. As a result, each device will have a slice of the input tensor with shape [100, 50]. If -1 is used in the tensor map, then the corresponding tensor dimension is not split across any mesh dimension. More examples of parallelization configurations are shown in Figure 1
. We have developed a complete set of rules to decide what are the valid parallelization configurations for an operator (e.g., redundant computation of the same tensor on different devices is also allowed for possible memory/communication saving). The details will be released together with the code (we will open source TensorOpt) and are omitted here for conciseness. Ascontains all feasible combinations of the device mesh and tensor maps, it can be very large when the number of devices and/or the dimension of the tensors is large.
Execution cost. For operator under parallelization configuration , its memory cost and time cost are defined as follows
where is the memory for storing the (partitioned) model parameter, is the memory for storing temporary tensors (e.g., tensors for use in backward propagation)222There are some other memory consumptions, e.g., for kernel execution and network communication, but we found that these consumptions are relatively much smaller., is the time taken to conduct the computation defined by operator (including both forward pass and backward pass), and is the time taken to synchronize the tensors associated with (e.g., for model parameter update in data parallel). Among them, and can be derived from the specification of in and the parallelization configuration , while and are measured by running the operator under the parallelization configuration multiple times. We also call the memory cost and time cost in Eq. (1) the operator costs.
For edge , its memory cost and time cost are defined as
where is the time taken to transfer the tensors between operator and operator (including both forward pass and backward pass), which depends on the parallelization configuration of both and (i.e., and ). We call the costs in Eq. (2) the edge costs.
With the costs of each individual operator and edge, we can define the execution time (or per-iteration time) , peak memory consumption , and communication cost of a complete parallelization strategy for the entire computation graph as follows
2.2 Related Work
Data and model parallelism. Data parallelism [data_parallel] is a common parallelization strategy adopted by deep learning frameworks including TensorFlow [Tensorflow]
, PyTorch[PyTorch] and MxNet [mxnet]. It keeps a copy of the model on each device and partitions the input tensor among the devices along the sample (batch) dimension. Compared with data parallelism, model parallelism [dean2012large, RLDevicePlacement] is more suitable for large models, e.g., those that do not fit in the memory of a single device, as it partitions the model among the devices to alleviate the memory consumption problem. However, the resource utilization of vanilla model parallelism is low as the devices execute different partitions of the model sequentially. Due to the increasing interest in training large models, recent works improve model parallelism with pipeline parallelism. Gpipe [gpipe] splits a mini-batch into several micro-batches and pipelines these micro-batches to reduce device idle time. PipeDream [pipedream] removes the mini-batch synchronization barrier in Gpipe to further improve device utilization but training becomes asynchronous. A dynamic programming algorithm is also proposed in PipeDream to find the model partitioning that minimizes the per-iteration time. However, asynchronous training often degrades the convergence speed of training and some models even cannot converge [ssp].
Manual strategies. It has long been observed that pure data or model parallelism may not achieve the best performance. One-wired-trick [OneWierdTrick] manually designs a parallelization strategy for CNN, which uses data parallelism for the convolution layers and model parallelism for the fully connected layers. Mesh-TensorFlow [MeshTensorFlow] provides a flexible parallel training framework that allows users to specify their parallelization strategies. However, users need to find a good parallelization strategy by themselves and manually program it in the code, both of which requires a good understanding of parallel training. Moreover, Mesh-TensorFlow has some restrictions on the parallelization strategies and we will show that these restrictions lead to sub-optimal performance.
Auto-parallel. Recently, some works propose to search for efficient parallelization strategies for DNN training using tailored algorithms. OptCNN [OptCNN] uses dynamic programming (DP) to find the parallelization strategy that minimizes the per-iteration time. The DP algorithm simplifies the model computation graph into a graph that contains only two nodes by conducting node and edge elimination, and finds the optimal strategy on the simplified graph using brute-force search. As OptCNN considers only execution time, its parallelization strategy may go out of memory for large models or when memory is limited. Moreover, the node and edge elimination of OptCNN is not sufficient for some models (e.g., BERT [bert]). FlexFlow [FlexFlow] works for a wider range of models using a random search algorithm to find the parallelization strategy. However, FlexFlow also only considers execution time and the parallelization strategy it produces may not be optimal. ToFu [Tofu] minimizes the memory consumption for training large models using DP. The DP algorithm splits a tensor among two (groups of) devices each time to reduce complexity and ToFu does not allow tensor replication to achieve low memory consumption. However, ToFu cannot leverage additional memory (those more than the minimum requirement) to reduce the execution time.
Memory optimizations. Some works reduce the memory consumption for training large models with extra communication or computation costs [vdnn, chen2016training, memoryopt]. VDNN [vdnn] swaps the tensors from GPU to CPU and reloads them for backward propagation to reduce peak memory consumption. [chen2016training] only keeps some of the tensors in memory and recomputes the other tensors when needed in backward propagation. However, these extra communication or computation costs may significantly degrade the training performance. Our methods could be extended by considering reloading and re-computation as possible parallelization configurations.
Compared with the related works, our FT algorithm and TensorOpt system significantly improve both flexibility and usability. By tracking the cost frontier, FT can adapt to different scenarios, e.g., reducing the memory consumption when the model is large and/or memory is limited, while minimizing the execution time when memory is sufficient. FT can also fully utilize available resources and translate additional resources (e.g., memory) into performance improvements. Compared with Mesh-TensorFlow, TensorOpt is much more user-friendly by using the FT algorithm to search for the parallelization strategy and automatically executing the parallelization strategy. Users only need to define the computation graph using the high-level API (as in vanilla TensorFlow) and specify their preferences for the parallelization strategy.
3 The Frontier-Tracking Algorithm
In this section, we first introduce the concept of cost frontier. As using brute-force search to find parallelization strategies on the cost frontier has very high complexity, we propose an efficient frontier-tracking (FT) algorithm. Finally, we conduct analysis to validate the low complexity of the FT algorithm. For simplicity, we present the cost frontier and the FT algorithm for tracking the trade-off between execution time and memory consumption , while generalizing our methods to tracking the trade-off between any pair of costs (e.g., memory consumption and network communication) should be straightforward.
3.1 Cost Frontier
The formal definition of cost frontier is given as follows.
Let be a set of (partial) parallelization strategy tuples, where is a (partial) parallelization strategy, and are the execution time and memory consumption of , for . The cost frontier of is the minimum subset of such that, for any strategy , there exists a strategy where and .
We provide an illustration of cost frontier in Figure 2, in which each point is a strategy tuple with randomly generated costs and points on the line are the cost frontier. According to Definition 1, for a strategy that is not in the cost frontier, we can find some strategy in the frontier that can reduce at least one of the two costs without increasing the other. Therefore, it suffices to find all parallelization strategies in the frontier of and . Users can choose a parallelization strategy in the frontier according to their situation. For example, if memory is sufficient, the parallelization strategy that minimizes per-iteration time can be used. When memory is limited, users can choose the strategy that minimizes memory consumption instead.
Given a set of strategy tuples, its cost frontier can be obtained using Algorithm 1. In Algorithm 1, is the list obtained by sorting in ascending order of memory consumption, is the tuple in , and denotes the time consumption of tuple . Algorithm 1 checks the tuples in ascending order of memory consumption and puts a tuple into if it has smaller time consumption than all tuples that precede it in . In the step, records the smallest time consumption from to .
A straightforward method to track the cost frontier is to enumerate all possible parallelization strategies for the computation graph , calculate their memory and time consumption according to Eq. (3) and find the cost frontier by applying Algorithm 1. However, this method has an exponential complexity. Assume that contains operators and each operator has parallelization configurations, this brute-force search needs to go through all parallelization strategies. As usually contains tens or even hundreds of operators for popular DNN models, brute-force search is infeasible. Therefore, we propose the FT algorithm to find the cost frontier efficiently given . FT relies on the following basic operations to manipulate cost frontiers.
Given two cost frontiers (or two sets of strategy tuples),
Product, which is the Cartesian product of two frontiers:
Union, which is the union of two frontiers:
Reduce, which is Algorithm 1, i.e., . As the result of product and union may no longer be a frontier, we assume that reduce is always applied after the two operations.
Intuitively, product constructs composite parallelization strategies by enumerating all possible combinations of and , and the costs of and are summed up in the product. The operation union places the tuples from and into a single set.
3.2 Frontier Tracking
Overview. The procedure of our FT algorithm is shown in Algorithm 2, which finds all parallelization strategies in the cost frontier of execution time and memory consumption for a given computation graph and device graph . Algorithm 2 can be decomposed in 4 steps, i.e., initialization, elimination, linear dynamic programming (LDP) and unroll. Initialization (Line 3) sets the cost frontier for each operator and edge in the computation graph . Elimination (Lines 4-11) simplifies the graph into a linear graph (as illustrated in Figure 4) and updates the cost frontiers of the operators and edges. LDP (Line 12) finds the cost frontier for the simplified graph and unroll (Lines 13-14) reconstructs the parallelization strategies in the cost frontier for the original computation graph . In the following, we explain each of the 4 steps in more detail.
Initialization. FT begins by initializing the costs for the edges and operators by enumerating all their possible parallelization configurations. With a slight abuse of the notations, we use to denote the tuple (i.e., the operator costs in Eq. (1)), which is the cost frontier for operator when it selects as the parallelization configuration. Similarly, denotes the tuple (i.e., the edge costs in Eq. (2)), which is the cost frontier for edge when operator and use parallelization configuration and . Although both and only have a cardinality of 1 when first initialized, their sizes may change when the FT algorithm updates them in elimination and LDP.
Elimination. FT conducts four types of elimination: node elimination, edge elimination, branch elimination, and heuristic elimination, to simplify the computation graph into a linear graph
. The first three preserve the exact cost frontier, while heuristic elimination significantly reduces the complexity with only a small loss in accuracy. Compared with the two types of elimination (i.e., node and edge elimination) in OptCNN[OptCNN], more types of elimination enable FT to adapt to a more diverse set of DNN models (e.g., BERT). Moreover, for each type of elimination, FT maintains the cost frontier instead of a single execution time. We illustrate the four eliminations in Figure 3 and discuss them as follows.
Node Elimination. FT conducts node elimination when an operator has only one input operator and one output operator. As shown in Figure 3(a), , and are replaced by a single edge in node elimination. The cost frontier of is deduced as follows
Under each combination of the parallelization configurations of operators and , is eliminated by summing its operator cost to the costs on edge and . Note that we apply reduce to the result of Eq. (4) to ensure that is a frontier, which reduces the size of and the complexity of subsequent operations. For each tuple in the frontier , FT records which parallelization configuration (i.e., ) takes to produce it in order to provide information for unrolling the elimination.
Edge Elimination. Edge elimination is conducted when there are multiple edges connecting the same pair of operators. Denote the edges as , these edges are merged into a single edge , as illustrated in Figure 3(b). The cost frontier of the new edge is calculated as follows
Under the same parallelization configuration of the up-stream operator and down-stream operator , the costs of the merged edges are added together for edge elimination. As node and edge elimination cannot simplify some complex computation graphs (e.g., BERT) to simple structures, we introduce branch elimination and heuristic elimination.
Branch Elimination. FT conducts branch elimination when an operator has multiple input operators and these operators cannot be eliminated by node or edge elimination. As shown in Figure 3(c), operator receives inputs from operators and , and and cannot be eliminated because they are not connected by an edge. Branch elimination removes either or by merging it into . If is merged, the cost frontier of is updated as follows
where is the concatenation of the parallelization configuration of and , and the costs of operator and edge are added to the cost of operator .
Heuristic Elimination. FT conducts heuristic elimination when the three types of elimination introduced before cannot be applied. For example, the attention mask is used by all the transformer layers in BERT [bert] and thus cannot be eliminated. An illustration is shown in Figure 3 (d), in which the computation graph cannot be simplified with other types of elimination. In this case, heuristic elimination simply decides the parallelization configuration for an operator , and removes along with all its out-going edges. We use multiple heuristics to choose a parallelization configuration for , e.g., minimizing the memory consumption of or a weighted combination of different objectives. After removing by selecting parallelization configuration , an operator that takes input from updates its frontier as follows
which adds the cost of edge to operator . Note that heuristic elimination does not guarantee to preserve the cost frontier. However, we found that it significantly reduces the running time of FT with only marginal loss in accuracy. This is because we usually conduct heuristic elimination for only a very small number of times. For example, heuristic elimination only needs to be used twice for BERT.
LDP. One can apply the aforementioned 4 types of eliminations to simplify the computation graph into a graph that contains only two nodes and then find the cost frontier for the simplified graph by brute-force search. This method is similar to the algorithm in OptCNN [OptCNN] and we call it FT-Elimination. However, we found that if the computation graph has a linear structure (as shown in Figure 4), its cost frontier can be found much more efficiently than conducting eliminations. Moreover, popular DNN models can be easily organized into a linear structure. For example, if we treat each residual block as a group for ResNet [resnet], then the groups form a linear structure. For BERT [bert], each transformer block can also be regarded as a group and the transformers form a linear structure.
Therefore, FT conducts elimination such that the resultant graph has a linear structure. We use a simple heuristic for this purpose in Algorithm 2 when choosing the nodes and edges to eliminate. Before elimination starts, we mark the first operator333According to topological order, ties are broken randomly. in the computation graph . During elimination, we do not eliminate the marked operators, and checks if the last operator we marked has only one downstream operator. If so, we mark that downstream operator as it is also on a linear structure. After obtaining a linear graph, Algorithm 3 (LDP) is used to compute the cost frontier.
For Algorithm 3, we assume that the cost frontiers of the operators and edges (i.e., and ) in the linear graph are properly initialized by the elimination procedures. The algorithm computes the cost frontier of from the operator that receives the initial input (numbered as ) to the operator that generates the final model output (numbered as ). For the first operator , we initialize its cumulative frontier as . For the operator, we use the product of , the frontier of edge , and the operator frontier to derive . As a result, represents the cumulative cost frontier from operator to when chooses parallelization configuration . We only need to consider the partial strategy tuples (containing parallelization configurations from to ) in when choosing the parallelization configuration for operator . This is because for a tuple (denote as ) that does not belong to , there is at least one tuple in (denote as ) that has lower time and memory consumption. As a result, cannot be in the cost frontier when we add the costs of operator and edge , which are common for both and . Finally, LDP reduces the cumulative frontier at the last operator (i.e., ) to find the cost frontier for the entire graph (Line 10).
We denote the method that uses LDP to solve the cost frontier as FT-LDP to contrast with FT-elimination. As we will show in Section 3.3, for a linear graph with operators and each operator has at most feasible parallelization configurations, the complexity of FT-LDP in Algorithm 3 is . In contrast, using FT-Elimination to track the cost frontier has a complexity of , which is much more costly than FT-LDP due to the large value of . We will also show in the experiments that FT-LDP has much shorter running time than FT-elimination.
Unroll LDP and elimination. FT unrolls the strategy tuples in the final cost frontier produced by LDP in Algorithm 3 to reconstruct the parallelization strategies for the entire computation graph . To provide information for unrolling, in each step of LDP and for each strategy tuple in , FT records the parallelization configuration of (i.e., ) and the strategy tuple in that produce it. Therefore, the final strategy tuples are unrolled by tracing back each step of LDP recursively. For unrolling elimination, FT records the parallelization configuration taken by the eliminated operator for each tuple in the cost frontier produced by the elimination. Once we know the selected partial strategy in , the parallelization configuration of the eliminated operator can be reconstructed.
Multi-threading for efficiency. FT can be easily parallelized by multi-threading. For LDP, computing for different parallelization configurations of operator (i.e., ) can be conducted in parallel as these computations only read . Similarly, for the eliminations, the frontier updates for different parallelization configuration choices are also independent. For example, in node elimination, under different and can be solved in parallel. Therefore, we spawn multiple threads to accelerate LDP and the eliminations.
Improving cost estimation accuracy. The memory consumption and execution time of the operators are relatively easy to predict [OptCNN, FlexFlow]. Thus the accuracy of cost estimation strongly depends on the quality of communication time (i.e., and ) estimation. FlexFlow and OptCNN calculate the communication time using the amount of data to be transferred divided by the speed of the network connection between the devices. We found that this estimation method can lead to very large error (e.g., more than 70%) for two main reasons. First, latency could dominate the communication time when transferring small tensors. Second, several communication operations could be executed by different devices simultaneously and these operations will contend for the PCIE or IB bandwidth, which makes communication time difficult to estimate.
We use collective operations for all the network communication and adopt a profile based method to estimate the commutation time. For collective communication operations, a parallel configuration of an operator divides the devices into disjoint groups (called device partitioning) and each group has the same amount of data to transfer. Although there is no communication between groups, different groups may still contend for bandwidth. Therefore, we profile the actual bandwidth under different device partitioning schemes and data sizes. Specifically, under each device partitioning scheme, we measure the actual bandwidth for collective communication with a data size of , in which and is sufficiently large to cover all possible data sizes. When predicting the communication time for data with a size of , we find the integer satisfying
and use the interpolation of the actual bandwidths atand . Our measurement shows that this method has an error of only in communication time estimation.
3.3 Complexity Analysis
In this part, we analyze the complexity of FT-LDP in Algorithm 3. The results show that FT-LDP has a complexity that is quadratic in terms of the number of operators in the computation graph.
For a set containing parallelization strategy tuples, its cost frontier can be obtained with a complexity of using Algorithm 1.
For a set containing strategy tuples, let and be the rank of tuple when sorting in ascending order of memory and time consumption, respectively. is said to have random order if and for , and and are independent.
In the following analysis, we always assume that a set has random order when solving its cost frontier. As we will see soon, the random order assumption implies that the cost frontier of a large set only has a small cardinality, which matches practice as most of the parallelization strategies are not favorable (i.e., both execution time and memory consumption are large).
For tuple set having random order and containing tuples, the expected size of its cost frontier is .
Denote the expected size of as , where is the cardinality of the tuple set . Consider the tuple having the minimum time consumption in (denoted as ), it is obvious that and tuples having larger memory consumption than do not belong to . The cost frontier of the tuples having smaller memory consumption than also belongs to
, and the number of these tuples follows a discrete uniform distribution ondue to the random order assumption. Therefore, we can get the following recursive function,
Solving the recursion gives .
We analyze the complexity of FT-LDP and FT-Elimination for frontier tracking when the computation graph is a linear graph. In this case, both and has a cardinality of 1. For more complicated graphs, the cardinalities of and depend on the elimination operations, which in turn depends on the specific structure of the computation graph. However, we also give the one-step complexity of FT-LDP and FT-Elimination when the cardinality of is not 1.
For FT-LDP in Algorithm 3, assume that operators and both have parallelization configurations, the cumulative frontier of has a cardinality of , and the edge cost frontier has a cardinality of . The complexity of solving the cumulative frontier for (i.e., for all ) is .
According to the assumptions, has a cardinality of as there is only one tuple in . needs to enumerate all parallelization configurations of and thus has a cardinality of . According to Lemma 1 and Lemma 2, the cost frontier of has an expected size of and finding it requires a complexity of . As all parallelization configurations of needs to be enumerated to find the cumulative frontier, the overall complexity is .
For a linear computation graph containing operators, and assume that each operator has at most parallelization configurations, the overall complexity of FT-LDP in Algorithm 3 is .
For a linear graph , the cardinality of is 1 for . The cardinality of is also 1 for any edge , and . The expected cardinality of is bound by because there are partial parallelization strategies from operator to as each operator has parallelization configurations. According to Lemma 3, the complexity for computing the cumulative frontier for operator is . Summing up the complexity from to , we obtain the overall complexity of Algorithm 3 as , which can be simplified as .
For a linear computation graph , FT-Elimination conducts node elimination (as in Eq. 4) to simplify it to a graph that contains only two nodes. In the following, we analyze the complexity of node elimination and FT-Elimination.
For node elimination in Eq. 4, assume that the operators (i.e., , and ) all have at most parallelization configurations, and have a cardinality of and , respectively. Then node elimination has a complexity of .
According to the assumptions, has a cardinality of and finding its cost frontier has a complexity of according to Lemma 1. For node elimination, we need to enumerate the possible combinations of the possible parallelization configurations of operators and (i.e., ). Therefore, the overall complexity of node elimination is .
For a linear computation graph containing operators, and assume that each operator has at most feasible parallelization configurations, the overall complexity of using FT-Elimination for frontier tracking is .
For a linear graph , both and have a cardinality of 1 initially. FT-Elimination will eliminate the nodes in according to the topological order and each time it will eliminate the second node in the remaining graph. For the time of node elimination, has a cardinality of while the cardinality of is 1. According to Lemma 4, the node elimination has a complexity of . Summing up the complexity from to , the result is , which can be reduced to .
4 The TensorOpt System
MeshTensorFlow requires users to find the proper parallelization strategy by themselves and explicitly program the strategy. FlexFlow and OptCNN are based on Legion, which is not a popular system and does not have rich packages as in popular DL systems such as TensorFlow and PyTorch. ToFu is not open source and thus its usability remains unclear. Moreover, these systems cannot track the trade-off between different costs, which is important for scenarios such as scheduling in a multi-tenant cluster and price consideration on the could. To solve these problems, we develop a system called TensorOpt to make auto-parallelism user-friendly.
4.1 Overall Description and API
TensorOpt is built on top of TensorFlow, with a minimal extension of TensorFlow’s API. TensorFlow scripts can be run as auto-parallel jobs on TensorOpt with only a few changes. Users only need to specify their preferences for parallelization strategy with some configurable options (to be introduced latter) and TensorOpt will invoke the FT algorithm to search for the desired parallelization strategy. TensorOpt also runs the chosen parallelization strategy automatically without user intervention and the details of parallel execution, e.g., the split of tensors among the GPUs and the communication among GPUs, are made transparent to users.
When running DNN training jobs, several factors, e.g., efficiency, parallelism444Parallelism refers to the number of GPUs to be used, which also determines the amount of available memory and is important to the training throughput (i.e., the average number of training samples processed per second). and cost, need to be considered. For a user who runs his job on an exclusive cluster, he may want to use all the GPUs in the cluster to minimize the execution time. But if the job is run on a shared cluster, the cluster scheduler may want to know the performance (i.e., training throughput) of the job under different parallelism to determine how much resource to allocate to run the job [gandiva]. When a user runs his job on the cloud, he may want to balance between cost and efficiency. Considering the different needs, TensorOpt currently provides the following three options for parallelization strategy search.
Mini-time finds the parallelization strategy that minimizes the per-iteration time while satisfying the memory constraint, under a user-specified parallelism. This option is suitable for running jobs on pre-allocated devices or an exclusive cluster.
Mini-parallelism finds a parallelization strategy that requires the minimum number of devices (to satisfy the memory constraint). It may be used for program correctness checking or cost minimization. This is because per-GPU throughput usually decreases with parallelism and thus training with minimum parallelism is most cost effective.
Profiling generates the minimum per-iteration time under a range of parallelism (without actually running the job), which is achieved by running the FT algorithm to minimize per-iteration time under these parallelisms. Note that a job may not be able to run if the parallelism is too small due to insufficient memory. If the parallelism is too large, per-iteration time may increase due to costly communication. This option can be used by the cluster scheduler or the cloud user to determine the proper parallelism for a job. Once the parallelism is determined, users can run TensorOpt in the mini-time mode.
We provide an example script of using TensorOpt for DNN training in Listing 1. The TensorOpt script is very similar to a TensorFlow script and there are only a few differences. We explain the key differences as follows.
init. As TensorOpt uses the MPI library, the MPI environment is initialized at the beginning. Hardware and network information are also collected for use in parallelization strategy search during initialization.
find_strategy. Users provide their preferences for parallelization strategy with the aforementioned options and the FT algorithm is invoked to find the suitable parallelization strategy according to user configuration.
build_execution_graph. The execution graph for the low-level TensorFlow execution engine is constructed using the chosen parallelization strategy.
4.2 System Design and Implementation
System workflow. Users define a computation graph using the high-level API and TensorOpt invokes the FT algorithm to find a proper parallelization strategy according to user configuration. Then TensorOpt spawns multiple processes (one for each device) and creates a TensorFlow execution graph for each processes according to the parallelization strategy. The execution graph describes how the job runs on multiple processes. We implemented wrappers for most of the key modules in TensorFlow, e.g., operator, session, and optimizer
. When creating the execution graph, TensorOpt propagates most of the parameters in the high-level API to the low-level API, e.g., the name of a tensor or operator, the initializer of a variable, and the strides or padding parameter for a convolution operator. However, the shapes of the tensors are not propagated to the execution graph as they are determined by the parallelization strategy. Users can use the distributed optimizers in TensorOpt in the same way as in TensorFlow and do not need to consider the details of parallelization.
TensorOpt also inserts communication operators into the execution graph for necessary communication among the processes. TensorOpt uses collective operations (e.g, allreduce and allgather) for all inter-device communication. Collective operations are more efficient and tractable (i.e., easy to predict performance) than peer to peer communication. A TensorOpt operator is decomposed into several TensorFlow primitive operators according to the need. For example, if the results need to be merged for matrix multiplication (e.g., , with the model parameter split along the column dimension), allreduce is conducted after the TensorFlow matrix multiplication on each device.
Flexible tensor splitting. MeshTensorFlow names each dimension of a tensor (called a logical dimension) and has two restrictions for splitting a tensor among the GPUs. First, the same device mesh is used for all operators in the computation graph. For example, four devices cannot switch between a one-dimensional mesh (i.e., ) and a two-dimensional mesh (i.e., ) for different operators. Second, if a logical tensor dimension is split across a device mesh dimension, then all operators having this tensor dimension also need to split across the device mesh dimension. For example, in a convolution neural network, if the batch dimension of the data tensor is split across all devices for the convolution layers (i.e., data parallelism), then the fully connected layers also need to be split in the batch dimension. However, model parallelism is usually more efficient for fully connected layers [OneWierdTrick].
Obviously, the restrictions in MeshTensorFlow reduce the flexibility of parallelization strategies, and hence degrade the performance as we will show in Section 5. Therefore, TensorOpt removes the two restrictions and allows different operators to have independent device mesh and tensor splitting. However, these flexibilities result in the re-scheduling problem and we provide such an example in Figure 5. Tensor splits among 4 GPUs in the length dimension when generated as the output of operator but the downstream operator requires to split in the sample dimension when used as input.
In this case, TensorOpt conducts tensor re-scheduling to adjust the output split of a tensor to the required input split. Collective communication is used for tensor re-scheduling and TensorOpt finds the optimal communication operations by solving a shortest path problem. TensorOpt builds a graph, in which nodes are different tensor splits while an directed edge connects two tensor splits if one can be transformed into another with only one communication operation and the edge weight is the time taken by the communication. Thus, the optimal communication operations correspond to the shortest path from the output tensor split to the required input tensor split. TensorOpt fuses the sequence of communication operations into one operator to reduce intermediate memory usage. The FT algorithm also takes the cost of tensor re-scheduling into consideration (as edge cost) when tracking the cost frontier.
Data loading. Existing auto-parallel systems (e.g., OptCNN and FlexFlow) require users to manually organize the data samples into the input split required by the parallelization strategy. For better usability, TensorOpt allows users to load training data by data parallelism and enjoy the data loading pipeline optimizations in popular deep learning frameworks such as TensorFlow and PyTorch. In this case, the operator that loads data is constrained to use data parallelism and the input data is adjusted to the desired input split via tensor re-scheduling when necessary. The cost of this re-scheduling is also considered when searching the parallelization strategy.
Tensor reuse. For some tensors, both the output operator that generates them and the input operator that consumes them need them for backward propagation. For a tensor that needs re-scheduling, the two copies before and after re-scheduling are physically different (having different splits), and a straightforward solution is to keep both copies (i.e., one for the output operator and the other for the input operator). To save memory, TensorOpt allows tensor reuse by providing three configurations for these tensors, i.e., keeping the copy before re-scheduling, keeping the copy after re-scheduling, and keeping both copies. If only one copy is kept, the other copy is reconstructed by re-scheduling when needed. Extra dependencies are inserted into the execution graph to ensure that tensor reuse is only activated during backward propagation. FT considers both memory and communication cost when choosing the configuration for a tensor.
|Model||Parameter (GB)||Batch Size||Memory (GB)|
5 Experimental Results
In our experimental evaluation, we first explore the trade-offs among different objectives (e.g., execution time, memory consumption, and network communication) for popular DNN models by analyzing their cost frontiers. Then we evaluate the accuracy and efficiency of the FT algorithm. We also test the efficiency of the TensorOpt system for distributed DNN training. In our cluster, each machine is equipped with 8 NVIDIA Tesla V100 GPUs (with 16 GB on chip memory), a 48-core Initel(R) Xeon(R) Platinum 8160 CPU and 256 GB main memory. The GPUs on the same machine use NVLink for communication, while GPUs on different machines use RDMA on 100 Gbps EDR Infiniband for communication. Unless otherwise stated, the experiments were conducted using 16 GPUs on two machines. The statistics of the models used in the experiments are listed in Table 1, where memory is the estimated peak memory consumption for training on a single GPU.
5.1 Cost Frontier Analysis
Cost frontier for different models. In Figure 6, we plot the cost frontier between memory consumption and per-iteration time for some popular DNN models. Note that each point on the cost frontier represents a parallelization strategy and the coordinates of the point represent the memory consumption (on each GPU) and the per-iteration time of that strategy. We are interested in the large models as training large models is more challenging. The shape of the cost frontier for some small models (e.g., VGG16) is also similar to that of the large models (e.g., WideResNet). We also decompose the per-iteration time of TensorOpt into network time and computation time, and plot them using dotted lines. We did not include FlexFlow [Tofu] because both OptCNN [OptCNN] and FlexFlow optimize per-iteration time and they have similar performances for most of the workloads. We simulated ToFu using our cost model by splitting all the tensors among all the devices and disabling tensor replication. For Mesh-TensorFlow, we solved its cost frontier by adding the tensor split restrictions. Data Parallel, OptCNN and ToFu provide a single strategy instead of tracking the cost frontier, and thus each of them corresponds to only one point in Figure 6. For RNN, the performance of Data Parallel is poor (taking 109 GB memory and 39 seconds per iteration) and we do not plot it in the figure for clear presentation of the results of the other methods. For WideResNet, the cost frontier of MeshTensorFlow is a single point that collides with Data Parallel. From the results in Figure 6, we can make the following observations.
First, the computation time remains stable under different parallelization strategies for TensorOpt, but the network communication time decreases when using more memory, which causes the per-iteration time to decrease. Therefore, the dotted green line can also be regarded as the approximate cost frontier between network communication time and memory consumption. For WideResNet, the computation time increases when memory is limited, because the parallelization strategies conduct redundant computation on different GPUs to reduce network communicansun.
Second, for all three models, the network communication time (and hence the per-iteration time) drops rapidly when we increase available memory to a certain threshold and remains relatively stable when memory exceeds the threshold. We call the point at this threshold the turning point on the cost frontier. We found that when memory is limited, tensors that need re-scheduling will keep only one copy and a re-scheduling is needed to reconstruct another copy during back propagation, which incurs communication overhead. When the amount of memory increases, the parallelization strategies tend to keep both copies for these tensors and thus the network communication time drops. It is difficult to further reduce the network communication time when memory is already sufficient as most re-scheduled tensors have enough space to keep both copies. From an economical point of view, the memory used at the turning point may be a suitable choice for memory provision as using less memory will significantly degrade the per-iteration time but investing more memory has only marginal performance benefits.
Third, by removing the restrictions on tensor split in MeshTensorFlow, TensorOpt significantly outperforms MeshTensorFlow. For both RNN and Transformer, the cost frontier of TensorOpt is always below that of MeshTensorFlow, meaning that TensorOpt has shorter per-iteration time using the same amount of memory. Moreover, MeshTensorFlow cannot work in the small-memory region, meaning that the minimum memory needed by MeshTensorFlow is significantly higher than that required by TensorOpt. For WideResNet, the optimal strategy of MeshTensorFlow is data parallel because the initial layers dominate the overall complexity and favor data parallel, and MeshTensorFlow cannot switch to other configurations for the other layers due to its restrictions.
Finally, for all three models, Data Parallel has poor performance with large memory consumption and long per-iteration time. OptCNN always finds the point with the shortest per-iteration time on TensorOpt’s cost frontier as it is designed to minimize the per-iteration time. In contrast, ToFu always uses a small amount of memory with a long per-iteration time. Compared with OptCNN and ToFu, TensorOpt can work for any point on the frontier, which brings better flexibility to adapt to resource availability and cost-efficiency trade-offs.
Influence of different factors. To better understand the influence of different factors on the cost frontier, we plot the cost frontier for training Transformer using TensorOpt under different model sizes and network settings in Figure 7. In Figure 7 (a), we control the model size of Transformer by adjusting its hidden size. The results show that for the same model structure with different sizes, the cost frontiers have similar shape but the turning point has larger memory consumption for larger models. In Figure 7 (b), no RDMA uses Infiniband directly (by disabling RDMA) for cross-machine communication and the bandwidth becomes approximately 0.5 times of RDMA, while 4x RDMA assumes the cross-machine bandwidth is 4 times of RDMA and corresponds to NVIDIA DGX, which has 4 Infiniband network cards. The results show that the cost frontiers have similar shape and the memory consumptions at the turning point are almost identical for different configurations. This is because under all three cases, cross-machine communication is slower than intra-machine communication (e.g., even 4x RDMA is 10 times slower than NVLink) and thus the parallelization strategies will always try to reduce the amount of cross-machine communication. However, the per-iteration time of 4x RDMA is only half of no RDMA at the turning point, which suggests that cross-machine bandwidth has a big impact on the performance. In Figure 7 (c), we train the model with 8 GPUs on a single machine but use different methods for intra-machine communication. The bandwidth of PCIE is approximately 1/20 of NVLink according to our measurement. The results show that using NVLink provides a significant reduction in the per-iteration time compared with using PCIE at the same memory consumption.
From the results reported in Figures 7 (a)-(c), different model sizes and network settings may result in different parallelization strategies with significantly different costs. As it is non-trivial to find the optimal parallelization strategy given a particular network setting (or other hardware setting) and model size, the ability to track the cost frontier makes the FT algorithm a powerful tool to efficiently characterize the influence of various factors on the training performance.
Flexibility in adapting to resource availability. One unique advantage of the FT algorithm over existing parallelization strategy search algorithms is its flexibility in adapting to different resource situations. We illustrate this phenomenon by plotting the relation between per-iteration time and parallelism for WideResNet and Transformer in Figure 8. Note that in practice, we cannot change the on-chip memory of the GPUs, and the amount of memory is actually controlled by parallelism (i.e., providing more memory by using more GPUs). The results show that when the number of GPU is small (e.g., 8), Data Parallel and OptCNN cannot run the training job but TensorOpt can. For both models, running with 8 GPUs may be the most cost-effective because the per-iteration time only reduces marginally for TensorOpt when increasing to 16 GPUs (possibly because of expensive cross-machine communication). However, Data Parallel and OptCNN require at least 16 GPUs to run. ToFu can run under a small parallelism but the per-iteration time even increases when more GPUs are provided. We found this is because ToFu excessively minimizes memory consumption, which incurs a large amount of costly cross-machine communication when using 16 GPUs. TensorOpt is flexible in using different levels of available resources because it tracks the cost frontier and can select any strategy on the frontier according to resource availability. When the number of GPUs is small, TensorOpt chooses a strategy with low memory consumption. But TensorOpt can also minimize the per-iteration time when the number of GPUs is sufficient. Moreover, the strategy transition in TensorOpt is seamless and automatic with the cost frontier.
5.2 Accuracy and Efficiency of FT
We use the FT algorithm to track the cost frontier and estimate the costs of the parallelization strategies. Thus, it is important that the algorithm provides an accurate cost estimation and runs efficiently.
We report the cost estimation error of FT for different models in Table 2. The error is defined as , where is the actual cost and is the estimated cost. The reported error is the average of 20 randomly sampled parallelization strategies for each model. The results show that FT has a small estimation error (below 8% in all cases) and consistently underestimates the costs. We found that FT underestimates the network communication time (and hence the overall execution time) because some communication overheads are not taken into consideration, e.g., the progress synchronization among the devices and the coordination messages for collective communication. FT underestimates the memory consumption because there are some temporary tensors that take up memory. To prevent TensorOpt from running out of memory, we can choose a parallelization strategy that has slightly lower memory consumption than the devices on-chip memory. For example, for GPUs with 16GB memory, a parallelization strategy with 14.5GB () peak memory consumption would be safe. We also found that using the simplified method in OptCNN and FlexFlow for communication time estimation (i.e., dividing the data volume by the network bandwidth) leads to large errors in cost estimation. For example, its estimation error in the network communication time is 74.8% for RNN.
|Model||Execution Time||Network Time||Memory|
We report the running time of the FT algorithm for different models when tracking the cost frontier under 16 GPUs in Table 3. The results were measured using the CPU of a single machine in our cluster. FT-Elimination uses elimination to simplify the graph to only two nodes, while no multi-thread disables the multi-threading in FT-LDP. The results show that FT-LDP has significantly shorter running time than FT-Elimination, which is consistent with the complexity analysis in Section 3.3. Multi-threading also effectively reduces the running time, especially for models with a large number of operators (e.g., WideResNet). Overall, the running time of FT-LDP is acceptable (tens of minutes for very complex models) considering the long training time of DNN models (e.g., days or even weeks).
|FT-LDP (no multi-thread)||17,432||0.40||1,535|
5.3 Efficiency of TensorOpt
We evaluated the efficiency of TensorOpt by comparing with Horovod [horovod] for training different models with 16 GPUs. Horovod is the state-of-the-art execution engine for data parallelism. We did not compare with ToFu because it is not open source. We also did not compare with MeshTensorFlow because it is hard to tune the parallelism strategy to run since MeshTensorFlow can only set the strategy manually. As OptCNN and FlexFlow are based on Legion, the comparison may not be fair due to the differences in execution engine. Horovod uses data parallelism for training and (in a way similar to TensorOpt) delegates single machine execution to TensorFlow. We used two configurations for TensorOpt, mini-time means minimizing the per-iteration time under the given parallelism, while data parallel uses the same parallelization strategy as Horovod. The Transformer model used in this experiment (with 4.8GB parameter) is smaller than the one in Table 1 as Horovod cannot run the large model.
The results in Table 4 show that TensorOpt in the mini-time mode achieves significantly shorter running time than Horovod for VGG16 and WideResNet, which validates the advantage of auto parallelism. In the data parallel model, TensorOpt has slightly longer per-iteration time than Horovod and we found that this is because Horovod only considers data parallelism so that it can merge the synchronization for small tensors to fully utilize the bandwidth. However, in auto-parallelism, we cannot merge communication operations as some operations may block the computation. For Transformer, the three configurations have similar performances because the per-iteration time of data parallel is close to minimum.
|TensorOpt (data parallel)||0.16||2.89||1.18|
We presented the FT algorithm for parallelization strategy search and the TensorOpt system for distributed DNN training. The flexibility of FT allows us to train large models with limited memory or maximize training efficiency when memory is sufficient. Based on FT, TensorOpt makes distributed DNN training more user-friendly by automatically searching and executing parallelization strategies. Using TensorOpt is as easy as vanilla TensorFlow, and users only need to define the computation graph and provide the preference for parallelization strategy. Our experimental results validates the effectiveness of the FT algorithm for parallelization strategy search and the flexibility of TensorOpt in distributed DNN training given different resource availability.