1. Introduction
Deep learning (DL) models has become increasingly complicated in artificial intelligence (AI) community to acquire better accuracy. Training deep models is extremely both time and resources consuming. Distributed training with multiple devices is a irreversible trend for training especially for large models.
(Amodei, 2019; Rajbhandari et al., 2019). To harness computing power to achieve better throughput, a critical challenge is how to map diversified workloads to hardware accelerators automatically and efficiently.Existing solutions like data parallelism, model parallelism and pipeline parallelism make tradeoffs between computation, communication, and development efficiency. Data Parallelism (DP) is workloadneutral to models that could be fit into single device while facing the problem of memory footprint pressure for large models. Model parallelism (MP) (Shoeybi et al., 2019; Shazeer et al., 2018; Jia et al., 2018a; Geng et al., 2019b; Dryden et al., 2019; Lepikhin et al., 2020) and pipeline parallelism (PP) (Huang et al., 2019; Narayanan et al., 2019) are effective way for alleviating the memory issue of large models, which split the model among processes, in vertical and horizontal way respectively. But the experts experiences are required to design a specific strategy to fully utilize hardware under the limited computation resources.
Some previous works(Harlap et al., 2018; Raffel et al., 2019; Jia et al., 2018b)
oriented to exploring distributed plans which combine dynamic programming and heuristic methods have been proposed as promising approaches for training complex models. But these approaches are designed for specific category of parallelism strategy.
(Harlap et al., 2018) aims to finding the best solutions of PP in a synchronous way and (Raffel et al., 2019; Jia et al., 2018b) tries to search the OPP. This leads to the limited scenarios mainly because the optimal strategies for diversified workloads are very different, and all these planners do not contain solution space of DP, operator partitioning parallelism (OPP) and PP at the same time. The heuristic method in (Raffel et al., 2019) results in the lack of generalization in NLP models. Another issue is that the planners are coupled with the specific APIs and deep learning frameworksso that they only take effect in a limited usage. Moreover, the coarse granularity exploration on layer or operatorlevel loses the potentials for better solutions. Moreover, to integrate the planner to other frameworks is unfeasible in realword.Recently, a trend of machine learning oriented approaches to optimize performance of systems has been receiving increasingly attention in AI research community.
(Mirhoseini et al., 2017) adopts reinforcement learning to learn the proper parallelism strategies, which inspired researchers to use learning approaches to extract features of deep models. However, it only search for simple model parallelism strategy without OPP and PP solution space for the given workload and clusters at the expense of huge time and resourcesconsuming, which leads to nonapplicable in industry.Above all, we conclude the following deficiencies of these approaches: (1) Limited applicable scenarios, which lacks the coverage of convolution, language model (LM), search/recommendation models at the same time. (2) Limited parallelism scenarios. None of these efforts have achieved the support of model parallelism, data parallelism, and pipeline parallelism on a unified computing layer (e.g, TF graph). (3) Inevitable code intrusion. The planners only take effect when specific APIs are called. It fails to to shield users from lowlevel distributed details.
We propose AutoMAP, a unified framework for exploring distributed execution plans, which works on HLO IR via DQN method for DNN workloads.
AutoMAP works on HLO IR instead of operators or layers. HLO IR
is an intermediate representation which produced by XLA (Accelerated Linear Algebra) from TensorFlow framework, which describes the entire training task with more general and expressive computation
Instructions instead of the operation like GraphDef in TensorFlow. Each instruction contains all necessary information for computation. Some extra information such as the corresponding operator name it belongs to is also recorded. There are two reasons for choosing HLO IR as the operational level of AutoMAP. One is that to explore distributed plans on HLO IR can achieve better performance benefits from its finer granularity than operators. The other is XLA can exist independently from TensorFlow and it has the ability to support other front ends like Jax(Bradbury et al., 2018) and Trax(authors, 2020), which leads to no invasion to user codes.Figure 1 gives the highlevel design of TF’XLA compiler. As the figure shows, the XLA compiler compiles a TF graph (an ML network in TF) into executable machine code through a sequence of stages. The TF graph is first transformed into HLO IR by a frontend (e.g., the API(39)). Optimizations, such as operator fusion and commonsubexpression elimination (Muchnick and others, 1997) are performed on HLO before the graph is transformed into a lowerlevel representation for a target hardware architecture.
Deep Q network (DQN) is a reinforcement learning (RL) is the approach to teach machines to interact with the environments and receive rewards for performing the right actions until they successfully meet their goals. It is adopted in AutoMAP to learn the features of deep models and provide workloadneutral distributed plans on given computation resources. It should be noted that the solution space is still huge even with DQN. Therefore, some heuristic pruning methods is also integrated in our approach. As far as we know, there is no previous work focusing on exploring strategies including the three category of parallelism simultaneously mentioned above with DQN.
As shown in figure 2, AutoMAP performs distributed plans exploration at
layer. Compared with previous approach, this has the following advantages: (1) Free user code intrusion. The user only needs to provide a singledevice model and the distributed details generated by our AutoMAP framework are absolutely shielded. (1) Rich and unified parallelism and application scenarios. Unify DP/MP/PP for CNN/LM/Recommendation models. (3) Diverse programming abstractions over HLO IR. Popular AI frameworks such Tensorflow, PyTorch
(Paszke et al., 2017), Flax(Bradbury et al., 2018)/Trax(authors, 2020) can all map to layer. In this work, we leverage DQN algorithm (Mnih et al., 2013) to automatically explore the search space of operator partitioning parallelism, auto data parallelism and pipeline parallelism over with device and network interconnect topology specified.In this paper, we focus on solving the two main challenges of distributing diverse and complex models to distributed heterogeneous hardware platforms: leverage DQN algorithm to build a search space including optimized strategies over , and leverage taskspecific pruning method for more efficiently exploration of search space.
To summarize, our contributions are:

We propose a unified framework named AutoMAP for three typical parallelism strategies (i.e., operation partitioning, auto data parallel and pipeline) and two typical model types (i.e., CNN and language models);

We leverage DQN with taskspecific pruning strategies to help efficiently explore the search space including optimized strategies;

We fully simplifies the burden of users in the selection and implementation of distributed execution plans. With our framework, users only need to provide a singlecard graph, and our framework automatically explores the distributed execution plans that is compatible with the hardware computing power/interconnection topology;

We show that our framework can find the optimal solution in a limited time compared to enumeration.
2. Problem Formulation and Preliminaries
Data and model parallelism have been widely used by existing deep learning frameworks to distribute the models across devices. Data parallelism is parallelization across multiple devices in parallel computing environments, which allows to operate on the data in parallel. For large models which cannot fit on single device, model parallelism turns out to be a good choice. Model parallelism (MP) (Bahdanau et al., 2014) partitions a DNN into disjoint subsets and trains each subset on a dedicated device, which reduces communication costs for synchronizing network parameters in a DNN but exposes limited parallelism as well as extra communication between model partitions.
Pipeline parallelism (PP) (Harlap et al., 2018; Huang et al., 2019; Fan et al., 2020) goes beyond DP and MP, mixing interbatch and intrabatch parallelism. In pipeline scheme, one or more consecutive layers are grouped into stages and processed with separate GPU(s), and both the forward pass and backward pass of all the layers are scheduled in one stage. (Narayanan et al., 2019; Fan et al., 2020) in PP is responsible for cutting model layers into stages and this approach improves device utilization through pipelining multiple microbatches. Figure 3 shows the schematic of those three parallel strategies.
Deep RL has been proven to be successful with Deep QLearning (DQN)(Mnih et al., 2013) introducing the idea of using neural networks as a Qfunction approximator. Rainbow DQN (Hessel et al., 2018) combining improvements in deep RL, and has been shown to be promising for further improvements of deep RL agents in benchmark environments. Although not so straightforward, We try to leverage rainbow agent to assist the search of massive distributed strategies space.
The Rainbow agent. Following the methodology from (Hessel et al., 2018), we extend the DQN algorithm with prioritized experience replay, double DQN, and dueling network architecture(Wang et al., 2016). Furthermore in contrast to (Hessel et al., 2018), we apply the following changes to successfully train the Rainbow agent: (1) we discard the noisy linear layers (Fortunato et al., 2017), relying on greedy exploration instead. Since the agent was already required to learn environmental noises from the user simulator, a possible explanation could be that the inclusion of a second noise distribution might have been too difficult to learn. (2) We adjust the number of DNN layers for different tasks. As the greater the number of layers, the stronger the network learning ability. Figure 4 shows the workflow of our leveraged method.
Problem formulation for DQN algorithm on HLO IR.
Formally, we define our learning task as follows. In reinforcement learning, the sequential decisionmaking problem is modeled using the Markov Decision Process formulation defined by the tuple
. For any Qlearning task, we need to define the following five aspects: state space, actions, rewards, policy and termination.We will illustrate our framework solving three optimization problems over directed HLO graphs. Let denotes a directed HLO graph, where is the set of nodes, the set of edges. In our settings, each HLO instruction refers to one node of and the dataflow between producer instruction and consumer instruction refers to the corresponding edge of . Specially, we refer the nodes with no inputs as , those nodes with no outputs as and the others as . Give device topology , these optimization problems are:

Auto Data Parallelism (ADP): Given a graph , find a subset of dimensions of from such that communication overhead of the propagation graph from the selected slicing dimension of to is minimized.

Operator Partitioning Parallelism (OPP): Given a graph , find a slicing strategy of all dimensions of all of , such that the average device utilization is maximized.

Pipeline Parallelism (PP): Given a graph and the number of stages () expected to be split, find a subset of nodes such that the pipeline length with crossstage communication overlap considered is minimized.
3. AutoMAP Approach
3.1. Exploration Workflow
In order to decouple the distributed plans from the APIs of specific deep learning framework, the exploration module should be constructed on an intermediate representation layer designed for describing computation flow of deep learning task. Specifically, we build our search algorithm over HLO borrowed from TensorFlow XLA. Figure 5 shows the workflow of our approach. Taken deep models written by any framework (e.g. TensorFlow, PyTorch, MXNet(Chen et al., 2015)), the XLA compiles and transfers the original flexible computation graph into HLO IR. The Plans Explorer will search three different categories of plans including data parallelism, operator partitioning parallelism and pipeline parallelism over HLO based on given computation resources.
For pipeline parallelism, we only do cut on the forward computation subgraph in HLO, which can be detected by the meta information of each instruction. Both the online inference and online training approach are provided respectively to explore pipeline parallelism. Users need to specify the number of stages in advance for both approaches. Finally, the workflow produces the best one among all available candidate plans.
We explore these three different categories of plans separately. To cope with the huge solution space and provide totally workload neutral plan, we use DQN approach combined with heuristic pruning instead of ordinary heuristic algorithms to search. For a specified workload, the corresponding solution would be found during training stage or inferred from models that has been trained offline. To adapt to the reinforcement learning training flow, state, action and reward should be carefully designed according to their objectives. We briefly introduce our approach in the following subsections, and the details of design and implementations will be discussed in 4.
3.2. Operator Partitioning Parallelism
3.2.1. DQN flow Setup
Since the trend goes to increase the size of deep learning models, the ondevice memory is a scarce resource for training tasks. Fortunately, the memory issue can be alleviated through model parallelism. In practice, an effective way to parallelize deep models is to partition operators, which not only alleviates the memory pressure but also parallelizes the computations. With operator partitioning, the saved memory can be used for injecting larger batch size to improve the GPU cluster utilization.
Each instruction produces a unique variable in HLO. Therefore, to partition operators is identical to partition variables of each instruction. The derivation rules of each instruction are designed carefully for inferring partitioning decisions of unknown variables or parameters from the known ones. Obviously, some partitioning plans are invalid because some derivation rules are violated. This can only be detected during the procedure called propagation, which performs derivation rules for each instruction when given the known partitioning decisions of variables or parameters. The propagation terminates when encountering the following three situations. (1) There is no enough information to derive the remains variables. (2) A conflict case is encountered for the violation of derivation rules. (3) All variables have been inferred without any conflict.
We agree that only trainable variables which respect to model parameters may be partitioned in our approach. We also set the heuristic objective for operator partitioning parallelism to partition trainable variables as much as possible.
In AutoMAP, since each trainable variable may has different dimension size, we make decisions for each dimension of each trainable variable about whether to be replicated or partitioned across all devices. These dimension status of all trainable variables are viewed as one strategy and the feasibility need to be verified by doing propagation in HLO.
State and Action. We define state
as one dimension vector which concatenates all dimensions partition status for trainable variables and there are three possible values at each position. And the
action is a binary flag which is True for partitioning across all devices and False for replicating among all devices.Reward. According to the objective mentioned above, we encourage partitioning by giving higher reward than replicating and punish the conflict case by giving negative reward.
3.2.2. Linkage Group
The search space of operator partitioning is so huge that even DQN requires lots of time based on the above setup, thus we introduce an heuristic pruning technique called linkage groups. Linkage group exists for each trainable variable, which records the deterministic partitioning decisions for other trainable variables caused by itself. Figure 6 illustrates the concept of linkage group. When the partition status of one dimension has been decided, the linkage group will be detected that whether current dimension with its partition exists. All the deterministic decisions of caused by current decision should be inferred via linkage group so that the search process can be greatly pruned to avoid unnecessary exploration.
Due to the termination conditions of propagation mentioned above, the linkage group does contain only parts of partitioning decisions of other trainable variables. That is because the propagation procedure driven by one trainable variable with its decision always stops early when no enough information is given. However, larger linkage groups always perform better pruning effect.
3.3. Auto Data Parallelism
Implementing data parallelism over HLO is not intuitive because variables representing the input batch size cannot be easily identified. It is observed that the batch size dimensions will follow the data flow throughout the entire model. As a result, most variables expected to be influenced when the partition happens on the batch size dimension. With the help of propagation procedure, the variables represented training data and labels with their batchsize dimensions can be easily detected.
More formally, the objective is to find the partition strategy for all input tensors, which results in the largest number of tensors to be partitioned. Moreover, the more tensors to be partitioned on the input tensor under the propagation rule, the closer to our objective. In AutoMAP, the
action and reward is almost the same compared to the operator partitioning task except the state. Specifically, we define state as one dimension vector which concatenates all dimensions partition status for all input tensors.3.4. Pipeline Parallelism Exploration by Online Training
There are two key issues in pipeline partitioning. One is to cut the model into multiple stages, and the other is to place them onto a given GPU cluster. In industry, GPU clusters are always hierarchical which has relatively higher communication bandwidth within each node than across nodes(28). In AutoMAP, we highlight that the exploration should be performed only on HLO. The main idea is that the distributed plan should allocate computation resources according to the computation ratio among all stages and the stage that allocated with more than one devices should be replicated. Figure 7 shows the common mapping from HLO to devices. Stage 0 is assigned with two devices with NVLink connection so that the gradients reduction could achieve good performance with NCCL(27). The activation between stage 0 and stage 1 are transmitted via Ethernet.
State and Action
. Pipeline length is an effective way to estimate its performance and is influenced by the activation size across stage, the computation time and gradients allreduce time in each stage. In AutoMAP, we precompute these features at every possible pivot and encode them into one vector before applying the final cuts at the current step. The action outputs the pivot at each step. If one cut has been applied on HLO, the model will be further split into two stages and we limit the next cutting point should not happened at the previous stage.
reward. For a pipeline model, we can calculate pipeline length to estimate performance. In this case, we use as reward for the higher performance could be achieved when is shorter.
3.5. Pipeline Parallelism by Online Inference
3.5.1. Motivation
For Pipeline Parallelism Planning, we also present an alternative approach for a faster and generalizable way of inferencing an optimal hybrid parallelism strategy. This would allow us to train at a large generated dataset and inference on a realworld model.
In order to find the optimal partitioning solution that yields maximal performance under our pipeline parallelism strategy, we need to 1) partition the model into different stages, 2) decide the replication factor for each stage, and lastly 3) map the stages to underlying hardware devices. In the following section, we will formulate our pipeline partitioning problem into a pure mathematical problem whose data can be randomly generated.
3.5.2. Problem Formulation
The original problem states: Given a/an HLO module , and the number of stages , find the optimal pipeline partition that minimizes the endtoend time of onebatch with pipeline parallelism.
And with our profiler, we can get the perinstruction performance data , which represents the execution time on a given profiling device for each instruction in milliseconds. For communication, we use our DFG analyzer on to calculate the parameter sizes later used for allreduce time calculation, and activation communication for each instruction if we partition the model at that specific instruction.
So this problem is now equivalent to: given three arrays of profiling data of an HLO model , and , each of length the number of instruction in the original model , find a partition that minimizes the endtoend time , which we can calculate with our value function: .
Since the number of instructions would certainly vary between models, and their profiling data might not even be close, we proceed with a round of data normalization described in the following section to ensure the training data has a consistent size and a reasonably close measure. And this this problem is now a array partitioning problem irrelevant to the input model, and the three arrays , , can be generated on large scales.
Our first approach presented above uses DQN to search through the solution space of for profiling data generated by each given model . This approach tries to train our DQN with generated data for this abstract array partitioning problem that could apply to real models upon inference.
3.5.3. DQN Workflow
State and Action. We use the three performance metrics mentioned above (, and ), and process the data along with device topology metrics to form the final state representation. The data processing will be detailed in section 4.
Reward. Since we want to minimize the time of completing one global batch, we use as our reward.
Training and inference. First we use the data generation method detailed above to generate the training dataset, which will be then used to create a large number of environments ready to be interacted with. During the training process, since each environment represents a distribution of performance data, we will restrict the number of interactions with one environment to a very small number. In practice, we set each environment to be explored and exploited 50 times.
For testing, we used a freshly generated environment that is not in the training set, and evaluate its performance by letting the network inference the best partitioning solution, and assess its performance with our value function.
For realworld model inference, we do the same data preprocessing described in section 4, and output the best network inference result.
4. Implementation
4.1. Overview
All distributed execution plans can be unified into the same DQN workflow. In AutoMAP, we select RAINBOW(Hessel et al., 2018) as the DQN framework built on PyTorch to go parallel search for all three category of strategies. We leverage cost model to estimate the performance of different plans so that the workflow can produce the best one among all candidates.
The key issues of DQN workflow for different scenarios are environment, state, action and reward. We introduce our implementation of those for operator partitioning parallelism, auto data parallelism and pipeline parallelism, respectively.
4.2. Operator Partitioning Parallelism
State and Action. In our current implementation, the state contains a decision vector and acurrent position. Figure 8 shows the representation of decision vector. All dimensions of trainable variables are concatenated into an one dimensional vector. The 1, 0, 1 stands for partitioned, replicated and undecided status, respectively. The information of current position is an integer which indicates the index in the decision vector that will be decided in the next step.
Initially, the decision vector is filled with all 1, which means no dimension is decided. Then, each dimension will be decided step by step in one episode until encountering an propagation conflict or all dimensions status have been decided safely. Figure 9 shows one complete episode.
The action is implemented as a binary value, which the positive and negative represent to partition and to replicate, respectively, and the decision result will take effect on the current position. When all dimensions of one variable are marked with 1, it means this variable should be replicated across all devices.
Reward We assign +0.4 and +0.1 reward to the case of partitioning and replication. A 1 reward will be given as the punishment when the conflict case is encountered caused by propagation in the entire HLO, while terminating current episode.
Linkage Group Linkage group should be extracted at the beginning of the DQN training task. The extracting procedure is displayed in figure 10.
Linkage groups are formed by propagating each variable and its possible decision in the entire HLO. Specifically, we pick only one variable with its decision and send this pair into the propagation module to infer other variables’ decision. Since propagation by only one variable and its decision cannot make deterministic decisions for every tensor, we only extract those deterministic ones. After all linkage groups have bee, the decision order of every dimension in DQN task will be sorted according to the size of linkage group from large to small.
With linkage groups, the reward is calculated according to the actual numbers of partitioned and replicated dimensions caused by current step if some decisions trigger more than one dimensions to be decided.
4.3. Auto Data Parallelism
State and Action. The philosophy of designing the state and action are the same with the case in operator partitioning parallelism. Since the trainable variables, hyperparameters and training data with labels are all in the input list, we need to filter trainable variables and hyperparameters out as much as possible. There is a heuristic that the constant tensors are definitely hyperparameters and the trainable variables are marked outside HLO, so it is not difficult to find all possible candidate tensors.
We construct all candidate tensors into an one dimensional vector. Moreover, the current position index is also needed. And the action and reward are the same as we design in searching operator partitioning parallelism plan so that the Q network is guided to partition tensors as much as possible. This is always consistent with the reality that the greater the number we partition, the more intermediate tensors will be affected. Above all, the only difference is that we do not have any linkage groups.
Reward. We use exactly the same reward as in operator partitioning parallelism problem for guiding the Q network to partition variables as much as possible.
4.4. Pipeline Parallelism Exploration by Online Training
In order to reasonably simplify the placement problem which maps from HLOcuts to devicecuts, we treat the hierarchical computation topology as linear model which starts from the first device in first node. However, the search space is still huge and contains lots of solutions that are unnecessary to be explored. From a practical perspective, the solutions of better quality always happened when cutting on the pivots that exactly maps to the network boundaries or their nearby. Moreover, each stage contains at least one variable is also required in our implementation for the objective that to balance variables loading on different devices. We apply these two heuristic pruning methods to filter out some candidates pivots before training in our implementation.
Firstly, we take the devicecuts which performs cutting on network boundary as center solution. Then, A threshold number is specified as radius to represent the available range around each devicecut in the center solution. Thirdly, the devicecuts are filtered out according to the center solution and radius. Finally, All possible pivots which maps from the devicecuts will be left as our candidate pivots.
State and Action. We precompute three features at each stage to encode state representation. We precompute the gradients allreduce time of entire pipeline if we cut at any pivot in HLO at current step. This feature is very useful when there is a nonnegligible bandwidth gap between devices within one node and cross nodes. The gradients reduction will be timeconsuming when some stages cross nodes caused by cutting on the inappropriate position. Figure 11 shows an example when cutting a deep model into four stages in one episode and the corresponding time cost of gradients reduction in each stage at each step.
The maximum activation transmission time is also required among all stages if we cut at any pivot at current step. To guide the time cost of each state towards balance, the computation balance ratios between minimum stage and maximum stage at any pivots are precomputed.
Masking the unnecessary pivots is necessary when making cutting decisions on HLO. Actually, there are two kinds of pivots should be applied with mask. One is the pivot that we have filtered out in pruning stage, the other is the pivot that in an previous stage. In order to mask them on the output of Q network, we set their Q values to that represents the lowest expectation on that action.
Reward. The pipeline length of a deep model can be calculated when given the pipeline parallelism plan by our cost model. As mentioned in 3.4, the pipeline reward is designed as . Moreover, the memory constraint should also be taken into consideration because some cutting strategy may encounter out of device memory. We give an to punish this case.
4.5. Pipeline Parallelism Exploration by Online Inference
We first describe data processing procedure, then introduce the DQN workflow.
4.5.1. Data Processing

Data Coarsening & Normalization Given a realworld model , we normalize the data into the same scale and size as data generated in the next section. This process is done in three steps: 1) building prefix sum, 2) coarsening array, and 3) normalization into [0, 1].

Step 1: Prefix Sum From profiling and DFG analysis on , we can get the profiling data , , and . We first build the prefix sum array for computational data and parameter size : , .
The two updated array and accessed at index now represents the computational time / AllReduce size of the first instructions. array is left untouched because it does not make sense to sum up all the crossstage communication before a specific instruction. still represents the estimated crossstage communication if the model was cut at instruction .

Step 2: Coarsening In order to adapt to models of different sizes, we need to scale the profiling data to a fixed number, which we empirically set to 128. For the above three arrays, we evenly take 128 points to form the new arrays: .
We can do this to and because they are already in prefix sum form, and also because the crossstage communication is specific to each instruction. After the coarsening, we lost the possibility to partition into instructions that are not in those 128 points, but the problem is now irrelevant to the input model size.

Step 3: Normalization Since we want to generalize across different models, and to generate large number of randomized data, some form of normalization is needed to keep all the data under a similar scale. In practice, we scale the three arrays simultaneously to :
After this step, regardless of what the original model is, the resulting arrays , , each has length 128, and the elements are all within . These three arrays essentially describe the distribution of computational times, activation sizes, and parameters throughout the model in the time dimension.


Data Generation To complement our existing model database, we choose to generate random data of different distributions that satisfies the requirement presented above. During the data generation, we use random number generator with optional distribution parameter (e.g. uniform, normal, binomial) to generate three arrays of float numbers ranging from 0 to 1, and then do the same transformation described above: build the prefix sum, coarsening and normalizing the array to get the generated .
In the actual training process, we will need to generate hundreds of thousands of these array groups to construct the training set and test set.
4.5.2. DQN workflow
State. For our state representation, we have the following data fed into the network:

Computational Times

Activation Sizes

AllReduce Sizes

Device Topology (square matrix describing the interconnect speed between any to device)

Intermediate Partition
All of them are resized to onedimension tensor, scaled to and concatenated into one single array to form the state.
Action. For this approach, we consider both HLO partitioning and device assignment as actions, and they share the same action space.
Reward.
with being the endtoend training time for one batch, so that maximizing the reward means minimizing the endtoend training time. This is the same as we used in 4.4.
5. Experiments
5.1. Experimental Setup
Benchmarks
We evaluate workloads for each distributed execution plan. Table 1 summarizes all the five representative DNN models that we use as benchmarks in this section.
HLO of workload
We feed HLO Json files and trainable variable list of each workload as inputs into AutoMAP framework. Another HLO text file is also provided for debugging in our experiments.
Simulated Hardware Configurations
Table 2 summarizes three hardware environments in our experiments. In our observation, the resources of 4 servers with 8 cards each are enough for training tasks. Therefore, we will give our execution plans with less than 4 servers.
Config  Servers 




A  2  8x V100  NVLink  25 Gbps  
B  3  8x V100  NVLink  25 Gbps  
C  4  8x V100  NVLink  25 Gbps 
Hyperparameters
We fixed the training batch size to 64 and use the Adam optimizer(Kingma and Ba, 2014) with different initial learning rate to optimize different exploration tasks. For pipeline tasks, the initial learning rate would be set to 0.001. But for operator partitioning and auto data parallelism tasks, we set a smaller learning rate to 0.0005.
As for the specific hyperparameters in DQN, we fixed the with 0.6 to all training tasks. We decay the exploration coefficient from 1.0 to 0.1 for all tasks, but the decay speed is totally different with respect to task type, which decay to the minimum after 2000, 500 and 10000 iterations for operator partitioning parallelism, auto data parallelism and pipeline parallelism, respectively.
Some general tricks for improving DQN convergence are also integrated in our training tasks. Specifically, we select the prioritized replay buffer(Schaul et al., 2015) and double DQN(Van Hasselt et al., 2016) in rainbow and fixed the alpha and beta to 0.2 and 0.6 respectively. The frequency for updating target network is set to 100 and the replay buffer size is fixed to 2000 in all training tasks.
5.2. Evaluation Results and Analysis
5.2.1. Operator Partitioning Parallelism
There are already some partitioning strategies for transformer models(Shoeybi et al., 2019)(Shazeer et al., 2018). It features to partition each attention block and the following MLP layer and all embedding variables while replicating other trainable variables, which is the same as the objective of AutoMAP. For VGG19, the effective way is to partition the last MLP block when given an hierarchical hardware configuration like Config B or Config C(Krizhevsky, 2014). Table 3 and Table 4 show our partitioning strategy of trainable variables for T5 family and VGG19, where means replication and the number greater than 0 represents the index of partitioned dimension. These partition strategies are consistent with our expectation. We have already known the ground truth of these workloads so that the quality of strategies could be measured in our experiments. We count the variables that should be partitioned as the target for each workload and to observe time cost to approach it. It is should be noted that some workloads need a finetuning stage to explore solutions of better quality.
Block or Layer  Variable Partition Strategy  
Selfattention  {q=1, k=1, v=1, o=0}  
MLP 


Embedding  { embedding_weights=0 }  
Layer normalization  {scale=1, bias=1} 
Block or Layer  Variable Partition Strategy  
Conv layers  1 for all conv layers  
FC Layer 


Softmax Layer  predictions/kernel=1, predictions/bias=0 
Model  PC target 





VGG19  38  5  30s      
T5base  111  111  0.5h      
T53B  432  397  0.74h  432  0.2h  
T511B  432  386  1h  432  0.45h 
We give the convergence for exploring operator partitioning parallelism on T5base in figure 16. T5base has 314 dimensions to be decided in total and 111 of them need to be partitioned according to the ground truth. With the help of linkage groups, DQN learns to avoid making conflicting decision quickly. It reaches the peak propagation progress and behaves more stable with higher scores as the time grows.
Table 5 shows our searching performance on all benchmark models. We pay attention to the time cost to partition all variables which is required to be partitioned. We divide the search process into two stages. The first stage will search from scratch and may converge into a local minimum, while the second stage is to finetune from that result. Some workloads like VGG19 and T5base may not need finetuning stage mainly because the state space is relatively smaller than others, so it is easy to find the partition strategy as quickly as possible in first stage. However, some workloads like T53B and T511B with more trainable variables should involve with a finetuning stage. Specifically, when the partition strategy is stable in first stage, the program will stop current training phase and backtrace some variables which are marked with replication according to the linkage groups and start a finetuning stage. We found that even with the complicated case like T511B, the expected strategy could be found in two hours.
VGG19 As shown in Table 4, the solution that OPP algorithm found is to replicate VGG19’s convolution layers while to partition the fully connected layers. This approach makes sense that for VGG19 the last two FC layers occupy 86% of the total parameters while the corresponding calculation time only accounts for 5%. For such FC layer we prefer partitioning to replication for reducing gradients communication overhead in synchronous training. This desired distribution strategy as described above occurs in 30s (Table 5) while our DQN scores keep oscillating slowly and cannot converge quickly. One reasonable explanation is that our reward func encourages splitting more variables while for VGG19 is not the case as explained above. This implies that we need a more general reward function for models with different calculations and parameter distributions.
T5base. The final solution is to split 111 variables and the partitioning results is the same with table 3. It is observed that our T5base takes 0.5 hour to find the expected solution without a finetuning stage.
T53B and T511B. 3B and 11B has the same layers and variables counts but the variables size and the propagation time cost. The expected variables to be partitioned are 432 and the finetuning stages are required, which take 0.94 and 1.45 hour for 3B and 11B, respectively.
We infer that the DQN searching behaves better than enumeration. For example, T5base has 188 trainable variables with at most two dimensions each, leads to a 376 binary vector which contains solutions in total and T53B and T511B contains solutions to search. It is impossible for searching the expected solution within a limited time, while the DQN method could reach within 2 hours.
5.2.2. Auto Data Parallelism
We first filter out all trainable variables and constant tensors in input list in HLO IR to find the candidate tensors that possible to be training data.
T53B and T511B are not available for data parallelism for the memory issue. T53B needs at least two devices to load balance its variables and T511B consumes more devices, thus we display the results of T5base in this part. Table 6 shows the results of auto data parallelism and all the tensor names in the table can be found in HLO text file.
VGG19. There are only 4 candidate tensors (with at most dimensions each) need to be partitioned for VGG19 as shown in Table 6. Our ADP algorithm can converge steadily in s to the first dimension of two tensors (namely arg0.1 and arg0.2). After manual verification, it is found that these two tensors are exactly the two inputs of the model: labels and features tensor respectively, and their first dimensions are exactly the batch size dimension in the traditional sense.
T5base. In our observation, we found that this procedure could be finished in half an hour. The search space is much less than in the operator partitioning problem. Specifically, there are 10 candidates with at most 4 dimensions each, leads to solutions. The DQN found the exact ground truth within 0.27 hour, while the enumeration would behave worse not only for the relative large solution space but also affected by the propagation time cost.
As the results shown in 6, there are 7 tensors need to be partitioned in total and all of them choose to partition the first dimension, which is consistent with our intuition. In machine learning training task, we feed them with some sequence and other format data. The batch dimension is always at the first rank for each tensor.
Model  Candidate count  Partition results  Time cost  
T5base  10 

0.27h  
VGG19  4 

70s 
5.2.3. Pipeline Parallelism Exploration by Online Training
We fixed all microbatch sizes with 16 in all experiments. Then we do strategy search on Config A, B and C respectively. The number of stages to cut is depend on the number of servers under each hardware configuration. Both the strategy produced by online training and inference will be displayed.
In online training experiments, we set the center solution which performs the devicecuts on network boundary and set the radius to 3. Table 7 shows the online training experiments for searching pipeline parallelism. In order to make the experiments results human readable, we report not only the pivots cut on HLOs but also the corresponding layers nearby. Since each instruction in HLO produces a new tensor named with a prefix, we display that tensor to indicates our HLO pivots. The devicecuts is displayed with an array which is filled with the cutting index of device. To cut the network boundary in the hierarchical topology hardware configuration like table 2, the index should be the multiple of 8 because there are 8 cards within one server.
We address that the time cost of DQN method is far better than enumeration, especially when we increase the stage number. That is mainly because each HLO contains at least thousands of instructions. Although we have filtered out some unexpected pivots and get a more concise candidates set, the search space is still large which costs more time by enumeration.
We take the convergence of exploring pipeline parallelism by online training on T5base as an example to show the training procedure with DQN. In figure 17, we The total scores is smoothed by applying moving average in order to show its trend. The figure shows that the loss drops very fast at the beginning and the trend of the total scores rises overall although the jitter is large.
Model  Config  Pivots on HLO  Corresponding Layer Nearby  Device cuts  Time cost  
Bert48 






T5base 






T53B 






T511B 





Bert48 and T53B. The two models are very similar from the results. All strategies on Config A, B and C proved that the cutting should be happened on the network boundaries, which are consistent with our expectation. Moreover, the pivots mapping to corresponding layer lead to almost uniform stages so that the computation on each stage are balanced. It takes about no more than 5 minutes to find them all.
T5base. The strategies on Config A and B are similar with the case of Bert48 and T53B. The time cost to converge is no more than 4 minutes. The strategy on Config C is different for the last cut happens on the 22th device, which is a NVLink boundary. This is because the constraint that each stage contains one trainable variable at least. T5model is too small for cutting 4 stages that the last cut should not happen beyond the 22th index.
T511B. This model is huge enough so that it will cause OOM if it is cut less than 4 stages. Therefore, the DQN cannot find even one available strategy on Config A and Config B. For Config C, the result is consistent with our expectation for cutting on the network boundaries of device topology. The time cost order of magnitude is the same with other models.
5.2.4. Pipeline Parallelism Exploration by Online Inference
Here we also present the results of our online inference approach. We trained our for
episodes of environments constructed by random number generated with uniform and normal distribution. Our model is able to output the best hybrid parallelism solution for the NLP family models like BERT and Transformer11B. For the CNN family, we need to finetune the model with the corresponding distribution for those models for another
episodes before the model could correctly inference the best pipeline partitioning.The detailed parallelism plan is presented in Table 8.
Model  Config  Partition Boundary  Corresponding Layer Nearby  Device cuts  
BERT48 





T5base 





T511B 




6. related works
Large DNN models are increasingly computational intensive and seriously consumption on device memory. It is a common practice to parallelize training by leveraging multiple GPUs(Pal et al., 2019; Jia et al., 2018a). Data parallelism, operator partitioning parallelism and pipeline parallelism are common approaches for distributed training of DNN models.
Auto Data Parallelism. There are some high level frameworks aim at reducing the burden of users to automatically parallelizeing deep models using data parallelism(Cheng et al., 2017).
Operator Partitioning Parallelism. For NLP models with attention blocks, some heuristic operator partitioning approaches(Shoeybi et al., 2019; Shazeer et al., 2018) have already been proposed in recent years. For some convolutional networks like VGG19 and AlexNet, it is a common practice to partition the last linear layers(Krizhevsky, 2014; Jia et al., 2018a).
Some prior works and studies(Jia et al., 2018a, b) focus on finding optimal distribution strategies over DNN layers.
Pipeline Parallelism. (Harlap et al., 2018; Zhan and Zhang, 2019; Huang et al., 2019; Geng et al., 2019a; Yang et al., 2019) has been proposed to train DNN by pipelining DNN models. GPipe(Huang et al., 2019) explores synchronous pipeline approach to train large models while PipeDream(Harlap et al., 2018) explores the hybrid approach of data and pipeline parallelism for asynchronous training. The RL approach has been proposed to find optimal placement strategy for a given DNN(Goldie and Mirhoseini, 2020).
Rainbow DQN. Reinforcement learning (RL) is a general framework where agents learn to perform actions in an environment so as to maximize a reward. DQN(Mnih et al., 2015) is a RL algorithm that combines QLearning with deep neural networks to let RL work for complex, highdimensional environments, like video games, or robotics. Double DQN(Van Hasselt et al., 2016), Dueling DQN(Wang et al., 2016), Noisy DQN(Fortunato et al., 2017) and DQN with Prioritized Experience Replay(Schaul et al., 2015) are these four important supplements which each of them handle a different aspect of an agent. Rainbow DQN(Hessel et al., 2018) is an offpolicy deep reinforcement learning algorithm that is the stateoftheart technique in the field of reinforcement learning.
7. conclusion
7.1. Summary
We introduce AutoMAP, a framework for exploring distribution strategies based on model architectures, which works on HLO IR and automatically discovers fast parallelization strategies with optimized DQN algorithm. Data parallelism, operation partitioning parallelism and pipelined parallelism are all included in the exploration space. We leverage DQN with taskspecific pruning strategies to help efficiently explore the search space including optimized strategies. AutoMAP fully simplifies the user’s burden in the selection and implementation of distribution strategies. Our experiments show that AutoMAP can find the optimal solution within two hours while achieving better throughput on several NLP and convolution models.
7.2. Future Work
Combination of HLO IR and DQN algorithm show convincing convergence results and performance. There are still some interesting works to follow. First of all, replacing discrete DQN states with continues one for operation partitioning task for better interpretation and convergence. Secondly, currently our AutoMAP framework can only give a single parallelization strategy automatically (i.e., DP, PP, operation partitioning), which may result in suboptimal runtime performance in largescale distributed training. In the future we will support exploring hybrid of these three strategies automatically. AutoMAP
is opensource and will be made available to the public.
References
 AIandcompute. Note: https://openai.com/blog/aiandcompute/ Cited by: §1.
 Trax — deep learning with clear code and speed. Note: https://github.com/google/trax Cited by: §1, §1.
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
 JAX: composable transformations of Python+NumPy programs External Links: Link Cited by: §1, §1.
 Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §3.1.
 Tensorflow estimators: managing simplicity vs. flexibility in highlevel machine learning frameworks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1763–1771. Cited by: §6.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Table 1.
 Channel and filter parallelism for largescale cnn training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–20. Cited by: §1.
 DAPPLE: a pipelined data parallel approach for training large models. arXiv preprint arXiv:2007.01045. Cited by: §2.
 Noisy networks for exploration. arXiv preprint arXiv:1706.10295. Cited by: §2, §6.
 Elasticpipe: an efficient and dynamic modelparallel solution to dnn training. In Proceedings of the 10th Workshop on Scientific Cloud Computing, pp. 5–9. Cited by: §6.
 Horizontal or vertical? a hybrid approach to largescale distributed machine learning. In Proceedings of the 10th Workshop on Scientific Cloud Computing, pp. 1–4. Cited by: §1.
 Placement optimization with deep reinforcement learning. In Proceedings of the 2020 International Symposium on Physical Design, pp. 3–7. Cited by: §6.
 Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377. Cited by: §1, §2, §6.
 Rainbow: combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2, §2, §4.1, §6.
 Gpipe: efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pp. 103–112. Cited by: §1, §2, §6.

Exploring the hidden dimension in accelerating convolutional neural networks
. Cited by: §1, §6, §6, §6.  Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358. Cited by: §1, §6.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
 One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997. Cited by: §5.2.1, §6.
 GShard: scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. Cited by: §1.
 Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972. Cited by: §1.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §2.
 Humanlevel control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §6.
 Advanced compiler design implementation. Morgan kaufmann. Cited by: §1.
 PipeDream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15. Cited by: §1, §2.
 [27] (2019) NCCL. Note: https://developer.nvidia.com/nccl Cited by: §3.4.
 [28] (2019) NVDIA dgx1. Note: https://www.nvidia.com/enus/datacenter/dgx1/ Cited by: §3.4.
 Optimizing multigpu parallelization strategies for deep learning training. IEEE Micro 39 (5), pp. 91–101. Cited by: §6.
 Automatic differentiation in pytorch. Cited by: §1.

Exploring the limits of transfer learning with a unified texttotext transformer
. arXiv preprint arXiv:1910.10683. Cited by: §1, Table 1.  Zero: memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054. Cited by: §1.
 Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §5.1, §6.
 Meshtensorflow: deep learning for supercomputers. In Advances in Neural Information Processing Systems, pp. 10414–10423. Cited by: §1, §5.2.1, §6.
 Megatronlm: training multibillion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1, §5.2.1, §6.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 1.
 Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §5.1, §6.
 Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. Cited by: §2, §6.
 [39] (2019) XLA: optimizing compiler for machine learning——operation semantics. Note: https://www.tensorflow.org/xla/operation_semantics Cited by: §1.
 PipeMare: asynchronous pipeline parallel dnn training. arXiv preprint arXiv:1910.05124. Cited by: §6.

Pipetorch: pipelinebased distributed deep learning in a gpu cluster with heterogeneous networking
. In 2019 Seventh International Conference on Advanced Cloud and Big Data (CBD), pp. 55–60. Cited by: §6.
Comments
There are no comments yet.