Recent years in industry, there is a boom of machine learning applications in a diversified range of scenarios, including images, speech/audio, NLP, CTR prediction, search and recommender systems built on commodity graphs at billions even trillions of scale, etc. Such workloads are generally regular in computation, and benefit a lot from modern high performance accelerators like GPUs or TPUs. In addition, many of such models are based on popular AI frameworks like TensorflowAbadi et al. (2015), CaffeJia et al. (2014), Torchtor (2015), CNTKYu et al. (2014), or MxNetChen et al. (2015). The challenge is how to optimize such workloads, to achieve as much performance as possible on modern hardware.
There are roughly two categories of computations in AI workloads. One is enabled by optimized vendor libraries, in particular MatMuls or 2D/3D convolutions and their variants. The other category includes foundational tensor operators, elementwise computations, memory layout transformations, and other workload specific ones. In order to understand the relative importance of MatMul/Convolution computations on various models, we collected data from 53,470 models on PAI (Platform for Artificial Intelligence) at Alibaba. Depending on application domains and hardware platforms, this number ranges from 40% to 70%. Therefore, computations other than Matmul/Conv deserve serious investigation to achieve decent performance. This work focuses on computation efficiency of this category on GPU platforms.
A well known challenge of Tensorflow computations is op granularity. Figure 1 shows the accumulated percentile distribution of memory footprints of six most frequent computing ops of those 53470 models collected from PAI. Note that the reduce(orange) line denotes the collective statistics of four ops, namely mean, sum, min, and max, while other lines represent individual ops. The x-axis shows the memory IO footprint size (in number of floats) measured at logarithm scale (base=2), the bigger the better. As can be shown, while memory footprints of MatMul/Conv2D are generally larger than elementwise/reduce ones, most op instances have small memory footprints. Therefore, optimizations are vital to fully leverage computing resources and memory bandwidth.
One way to solving this problem is replacing those fine grained ops with a pre-implemented coarse grained versionPAI (2018). However, this approach is not scalable and incapable of adapting to diversified and fast evolving workload characteristics. The more principaled solution is compiler driven op fusion, tuning and code generation. One possibility is to seperate the schedule specification and implementation. By offloading the optimization plan specification to the user, the runtime focuses on details of optimization realization. Prior explorations have shown promising results in some application domains, such as Halide on image processingKelley et al. (2018), loopyKlöckner (2014) and TVMChen et al. (2018) on array and machine learning computations. To make this layered approach work in practice, we believe the performance insight into the target workload domain is essential. With such insights, and given a properly defined workload model and target platform, it is possible to go a step futher, by allowing the user to specify computation only, rather than implementation schedules, as is the case in Tensorflow.
In Tensorflow, the fast evolving XLA framework provides a sound foundation to explore this problem further. XLA’s approach of partitioning a Tensorflow graph into compilable clusters, and transforming them into concise and compact HloModules, opens up broad possibilities to fuse, transform, and optimize computation kernels for GPUs.
Current XLA op fusion algorithm, either GpuInstructionFusion or MultiOutputFusion, relies on a set of static ShouldFuse rules in order to produce supposedly profitable larger kernels. While these rules discern fusion opportunities in many cases, it is usually compromised by exceptions, such as expensive elementwise ops, column reductions, batched matmuls, or memory layout transposes. In addition, at the code generation phase, XLA requires all ops in the fused computation to fit into a single parallel loop emitter, and leverages the elemental_ir_emitter to compose computations of previous ops into the root loop body. In this approach, all ops in the fused computation in essence must share the same implementation schedule. We call this approach thread composition, as shown in Figure 2(a).
This work proposes FusionStitching, a deep fusion and code generation system. One key feature of our system is block composition at the codegen phase, as shown in Figure 2(b). To support this, we propose another ir_emitter, IrEmitterStitched, to stitch multiple computations together. In theory, we allow each computation to have its own parallel loop emitter, and use on chip shared memory (scratchpad) as intermediary between producing and consuming computations. In practice, however, due to the scarsity of shared memory space, we leverage thread composition together with block composition.
This opens up further optimization tradeoffs and fusion opportunities. At the op fusion phase, due to the codegen capability enabled by block composition, we can further relax constraints that impose on XLA, and thereby enabling much larger kernel granularity. Besides, to cope with much bigger implementation space of larger kernel sizes, a systematic approach is desirable to specify, optimize and tune kernel optimizations. Specifically, we make the following contributions:
We propose a novel deep fusion algorithm based on a layered nodes structure along the span (critical path) of the graph. Using critical path reduction as the driving heuristic, we consider not only producer/consumer fusion opportunies, but also fine granularity ops that occur in the same layer, in order to enlarge kernel granularity and reduce GPU launch overheads.
We propose a comprehensive mechanism to specify implementation schedule space, resolve schedule constraints, tune the search space and generate the final implementation plan, represented as a set of launch/schedule parameters.
We propose IrEmitterStitched, another ir_emitter to support block level computation composition. The core part of this ir_emitter is a shared memory planning algorithm which orchestrates the code generation of individual ops within the fused computation.
This paper is organized as follows. Section 2 discusses motivation and overview of FusionStitching. Section 3 presents our deep fusion algorithm. Section 4 discusses schedule specification, tuning and determination of optimized implementation plans. Section 5 presents details of code generation and the shared memory management algorithm. Section 6 shows experimental results. Section 7 discusses related works and Section 8 concludes this work.
2 Motivation and System Overview
2.1 The Motivating Example
We consider the fusion and code generation problem of general sub-graphs that consist of four types of ops: (1) Elementwise; (2) Shape modulation ops, such as Reshape, Bitcast, Transpose, etc; (3) Reduction; (4) BatchMatMul. We include BatchMatMul because in some of our critical production models, such ops usually involve workload specific shapes, and cuBLAS kernels do not deliver satisfactory performance.
Figure 3 shows a motivating example. We arrange ops in a layered structure (denoted as circled numbers), with layer being input ops (top), and layer being output (BatchMatMul (Dot.1) in this case). In complex graphs, such a layered structure proves to be very useful to fusion decision making. Black arrows (from left to right) show data dependances. Next to each op, there are annotating texts that show important schedule/code generation attributes. The (SplitDim, Sword) pairs are schedule parameters used for tuning the implementation space (Section 3). One use of this pair of parameters is to decide Blocks, the number of thread blocks (CTA) used for computing this op. The Shared attributes associated with reduce, exponential, and divide are related to on chip shared memory management. Together with the Size attribute, ALLOC or SHARE mean we need to allocate space for the current op, or reuse a buffer allocated for a previous op, respectively.
Our system provides the capability to fuse and generate optimized code for the entire graph. Whether it’s beneficial to fuse dot depends on workloads. In some of our inhouse workloads, the batched dot shape is too marginal to get any benefit calling cuBLAS, and this pattern happens to be the core part of an inner loop body. In this case, fusing everything proves to be very useful. In general, we leave the decision of whether to fuse BatchMatMul to the user.
2.2 FusionStitching: The System Overview
The system overview is shown in Figure 4. On the high level, the system takes a HloModule as the input, passes three stages of processing (op fusion, schedule planning and code generation), and finally generates the LLVM IR.
In the computation fusion stage, we first perform a Work/Span (critical path) analysisGuy (1996), and allocate a layer number (as shown in Figure 3 for each op according to its depth in the span. Then starting from the root (such as Dot.1 in Figure 3), we fuse ops iteratively across different span layers, as long as the fusion decision passes the schedule consistency check. The fusion process iterates until no fusion opportunity is available. Then the transformed HloModule is passed on to schedule planning.
The implementation space for a large fused computation can be huge. Schedule planning searches a domain driven, well defined schedule space for optimized implementations. It takes a fusion plan as input, and generates optimized schedule parameters, shared memory usage plans and launch dimensions for the following code generation phase. It also provides performance heuristics regarding current fusion plan as feedback information to ScheduleConsistencyChecker. The scheduling process involves four important submodules: schedule generation, performance library, tuning and shared memory planning, which will be presented in detail in section 4 and section 5.
Based on schedule parameters, the code generation pass finally allocates shared memory, set the kernel launch dimension, and emits LLVM IR code for each op according to its tuned parameters. Note that the use of shared memory as an intermediary among ops is important to achieve decent performance. One example is shown in Figure 3. If no shared memory, expensive ops like exponential and divide can only be composed through thread composition. In this case, computation of these ops will be nested into the inner loop of the root dot op, causing notable performance degredation due to the duplicated computation. With shared memory, threads within a thread block for different ops can cooperate differently, thus opens up more optimization opportunities (section 5).
3 Deep Fusion
3.1 The Work/Span Analysis
Work/Span analysis is a handy tool to analyze parallel work and the critical path of computation graphsGuy (1996). In our implementation, we assign a number, the span to each instruction of the HloModule. First, the root instruction have zero span. For any other instruction, its span equals the maximum span of its users plus one. Work/Span analysis is also useful to profile entire Tensorflow graphs. However, standard Work/Span analysis works fine only when the graph is absent of dependancy loops. It is not uncommon for practical Tensorflow graphs to include large, possibly nested while loops. In this case, we perform an preprocessing step to partition all nodes into multiple subgraphs, each belonging to a separate frame context. We then perform Work/Span analysis for each frame context independently.
After running this analysis, the maximum span assigned is the length of the critical path. Instructions with the same span are on the same layer (as shown in Figure 3), and there are no data dependances among them. Next we present our fusion algorithm that leverages this information to enlarge the kernel granularity by effectively reducing the span of the computation.
3.2 The Fusion Algorithm
Through Work/Span analysis, we partition all instructions within a module into numerous layers (as shown in Figure 3), where instructions in each layer have the same span. Today most AI models rely on library calls to perform MatMul/Conv. Since we do not fuse across library calls, we are interested in computation subgraphs that exist in between any two consecutive library call layers (LC-layer). The basic intuition of our fusion algorithm is to fuse as many instructions as possible, considering various fusion constraints, in the subgraph of computations between two LC-layers.
Starting from a given LC-layer, up to the next LC-layer (roof), for each layer (denoted as the root layer in discussions below), repeat the following two steps. First, we perform an intra layer ElementwiseFusion at the root layer, resulting in a set of fused computations. Second, for each fusion instruction (fusion_root) in the root layer, we use Algorithm 1 to perform sub-graph fusion up to the next roof.
ElementwiseFusion targets intra layer nodes without producer/consumer relationships. The primary target is for small weight accumulation layers which occur frequently in training graphs. For a large number of such fine grained (eg. ) kernels, fusing them together can reduce substantive launch overheads. The exact number of fused computations to generate depends on two factors. One is schedule compatiblity. In practice, elementwise instructions within a layer naturally fall into a few groups according to output shapes. We will discuss more on schedule planning in the next section. The other factor is the fused memory footprint. We use a tunable threshold parameter to control the fusion granularity, in order to avoid extra large elementwise computations with too many outputs.
One implementation of subgraph fusion, starting from fusion_root, up to the next roof, is shown in Algorithm 1. The map hlo_span is the result of the Work/Span analysis, recording the span for all instructions. The algorithm traverses instructions layerwise, starting from the next layer, up to the roof. During this traversal, instructions are fused (put in the fused set), or gave up (put in the giveup set).
The procedure SchdConsistent decides whether or not to fuse an instruction hlo with the fusion_root. First, it checks if hlo has a user in the giveup set. If so, fusion stops in order to avoid potential cyclic dependance loops. Second, it checks if hlo has a user in the fused set. If not so, fusion stops because we consider producer/consumer fusion only here, and leave the other case to ElemwiseFusion, as discussed above. Finally, it checks if it’s possible to resolve an optimized schedule for the fused computation, and stops fusion if not. We will discuss schedule planning and optimization in more detail in Section 4.
4 Schedule Planning
4.1 Schedule Specification
In a fused computation, each instruction has an output shape, which defines the total work space. The implementation space for each instruction can be huge. It is prohibitively expensive to exhaust the entire space in order to get the most optimized kernel. Before we discuss our tradeoff on this issue, let’s reiterate two major objectives of op fusion: (1) reduce the memory footprint of the fused computation; (2) reduce the number of kernel launches due to the fine grained nature of many Tensorflow ops. In practice, most fused computations in our workloads are memory intensive, elementwise computations. Thus our design rationale is to facilitate the composition of numerous instructions within a kernel in order to fully take advantage of hardware resources, rather than pursuing extreme performance of individual ops.
For each instruction, we define three parameters on the output shape (the work space) to fully specify an implementation schedule: , , and . The idea is to split the work space into multiple data chunks, where each thread block (CTA) works on a chunk. Here, denotes a dimension where we split the work space. denotes how we partition the dimension . can be either Row or Column.
As an example, Figure 5 shows a Row schedule (left) and a Column schedule (right) for Reduce, both reducing input tensors of dims to output tensors of dims(for the left Row schedule dimensions , are reduced, while for the right Column schedule dimensions , are reduced, which correspond to the gray cells in the figure). denotes the number of GPU thread blocks, or data chunks after partitioning the entire work space. In Row schedule, we use dims on the left (more significant) of as . The C code excerpt shows details on how to calculate using ,, and shape dims.
Let be the number of dims for a tensor shape. must be integers in range of . Let be the the size of dimension . must be a divisor of . can be either or . Given a hlo instruction, the Cartesion product of legal value sets of , , and defines the entire schedule space.
The size of the schedule space of a single op depends on its output shape dims, but is usually small in practice. This is important for compilation speed. Our schedule specification has relatively small search space. It is not designed for exhausting all implementation possibilities. Yet, together with tuning, it enables most important kernel optimizations we want for GPUs, while at the same time allows simple and concise code generation implementations, thanks to the computation regularity of Tensorflow ops.
4.2 Schedule Constraints and Propagation
The output shape of the fused computation is the same as that of its root (output) instruction. Given a valid schedule of the root instruction, we must decide whether it is satisfiable by all other instructions of the computation. We use an algorithm conceptually similar to Work/Span analysis to resolve schedule constraints for all other instructions. Note that for each instruction, the valid schedule is defined on its output shape. If schedule constraints are satisfiable, we re-iterate and back propagate the schedule to its input shape(s), which correspond to the output shape(s) of its operand(s). Otherwise, the schedule initiated from the root instruction is not satisfiable for the computation.
Figure 5 shows an example of schedule constraints. For , we require all reduction dims be in the same thread block in order to balance between codegen simplicity and kernel efficiency. In this case, if , only is meaningful. The case for is similar. Other constraints include the divisability requirement needed by .
|Reshape||Reshape transform and|
|Broadcast||Broadcast transform and|
Table 1 summarizes rules of different ops for schedule constraints propagation. For Elementwise, propagate the schedule either or , back to its operand(s). For Reduce, only Propagate schedule if . If , adjust and according to reduce dims. Propagate or schedule if . Only propagate schedule if . The case for Transpose is similar. For BatchDot, only schedules are propagated and must be a batch dim (), otherwise the schedule is not satisfiable.
The Reshape or Broadcast modulates shapes. Therefore, we first transform the output and to the input and , according to schedule specification shown in Figure 5. Then we propagate the schedule, either or .
4.3 Schedule Tuning
There is always a valid schedule for any fused computation, with and . In this case, we only use one thread block for all instructions. However, in practice this will always lead to under unitilization of GPU resources. Together with the performance library (discussed blow), schedule tuning iterates over all candidate schedules of the root to look for the most efficient one. We use this optimized schedule to direct code generation.
If the fused computation has one single root, we iterate over all its candidate schedules. For each schedule, we test if it is satisfiable. If true, we lookup the performance library w.r.t. the schedule, and sum up the kernel execution time of all ops of the computation. The schedule with the best performance is chosen for code generation.
If there are multiple roots, we use a two-stage approach to speedup exploration of the search space. In the first stage, we iterate over all roots. For each root, we compute two sets, one is the valid set (shown in Figure 5), the other is the set of valid schedules corresponding to valid . Once sets for all roots are available, we perform intersection on these sets to resolve all candidates blocks that satisfiable by all roots. This reduces the performance tuning space that needs to explore next.
The second stage starts from the resultant set agreed by all roots. We iterate over all schedules corresponding to the set. For each schedule, we accumulate the kernel execution for each root, and sum them up to obtain the whole performance metric for the computation. The schedule with the best performance is our chosen target.
In implementation we perform two additional optimizations. First, in evaluating performance of individual instructions, we sometimes ignore those computationally trivial ops, such as Reshape, broadcast, small Transpose ops, etc. Such ops can be inlined via thread composition (similar to ElementalIREmitter in XLA) with negligible performance loss. Yet if we keep them, their strict modulation of shapes sometimes rejects highly optimized schedules. Bypassing them can make optimized schedules be satisfiable.
The second optimization is further pruning of the search space if there are multiple roots. During the second stage of schedule evaluation, we always keep the best performing schedule achieved so far. If, during the evaluation process of some schedule, the execution time accumulated has already exceeded that of the total latency of the best schedule, we simply skip the process and continue to explore the next schedule.
4.4 The Performance Library
The performance library is a key-value store, which contains kernel performance data of various types of instructions under different implementation schedules. Common features included in a key include opcode, shape, split_dim, sword, sched_type and thread block size. The thread block size is an integer in , and must be a multiple of GPU warp size (). There are also op specific features. For instance, Reduce (or Transpose) has an additional feature, reduce_warps (or trans_warps), meaning how many GPU warps in the thread block are used to perform the reduction (or transpose) loop.
We keep the performance library in permenant storage for repeated usages. At system initialization, the library is loaded into memory. During the tuning process, the library module takes schedule keys as lookup requests. If the key exists in the library, the result is returned immediately. Otherwise, the module constructs a CUDA C kernel from the key, compiles and executes it on the GPU. We use the nvprof tool to collect the kernel execution time and insert the new key-value pair to the library for future use.
When a key misses the library, the kernel generation and performance collection seems to be costly operation during JIT compilation. This is true in the initial warmup phase. Later on we observe high degree of data reuse in our workloads. In addition, as discussed before, most kernels only take several to tens of microseconds to execute. Nevertheless, it should be possible to build a learning model to predict a performance metric from features in the key, and return the predicted value to the tuning process immediately, thus shortening the critical path by offloading the kernel generation, compilation and execution asynchronously. We will leave this as future work.
The goal of fusion is to pack computations of all instructions into a single kernel. In schedule tuning, we are using accumulated performance of individual ops to measure the performance of the kenel of the entire fused computation. This approach does not mean to predict exact execution time of the fused kernel, but works well in reaching an optimized set of parameters to effectively direct code generation.
Based on the concise specification of schedule space, effectively schedule space exploration and performance library driven tuning mechanism, FusionStitching can efficiently enumerate huge number of fusion possibilities, thus open much more opportunity for the subsequent code generation phase, as illustrated in Section 6
5 Code Generation
5.1 Shared Memory Planning
The on chip shared memory is essential to facilitate thread block composition of numerous compute expensive ops. This is important to achieve relatively large and optimized kernels. To perform shared memory planning, we first identify candidate ops which may need shared memory, then prioritize shared memory usage to most critical ops when space is not enough, and facilitate space sharing among ops on the data flow.
5.1.1 Size Requirements Analysis
Size requirements analysis identifies all ops that may use shared memory. In the example computation shown in Figure 3, ops in green boxes have shared memory requirements. There are several cases to note.
One is direct allocation. For Reduce or BatchDot, if it is not the root instruction, we must allocate shared memory for its itermediate results, allowing consumer ops to use seperate parallel loop emitters to generate code.
Other cases are related to expensive elementwise ops, such as Exp, Divide, Log, etc. In general, if such an instruction has multiple users, we may want to allocate shared memory to buffer its results in order to achieve as much computation reuse as possible. Note that this is true even for inexpensive ops as well. This is performance consideration. However, if size requirements have reached a limit, we shall give up shared memory usage in a proper order in these cases, by recomputing those elementwise ops to ensure correctness.
For an expensive elementwise op, sometimes even if it has only one user, we must use shared memory in order to achieve acceptable performance. One example is shown in Figure 3. The Divide.1 is followed by Bitcast.1, which is then followed by a BatchMatMul (Dot.1). Due to high degree of data reuse in Dot.1, shared memory is important for performance here. To address this issue, we analyze data flow in this case in order to identify all such expensive ops.
5.1.2 Size Shrinking
Size shrinking is a technique when size requirement of the fused computation exceeds the shared memory limit. One main reason this happens is when is small, where each thread block needs to process a large chunk of data. The basic idea to this problem is to trade shared space for recomputation. To reduce size requirements, we start from inexpensive elementwise ops with multiple users, then expensive elementwise ops with multiple uses, finally expensive ops with transitive uses by BatchMatMul. Even if we follow this order, there may still be multiple candidates to choose. In this case, we prioritize the one that is closest to the root instruction in the span of the graph.
Size shrinking is a best effort approach to reduce shared memory usage. If, after shrinking and space sharing analysis (discuss blow), there is still not enough space, a feedback signal is generated back to ScheduleConsistencyChecker in the fusion module to trigger other fusion decisions. In practice, this happens only on large fused computations where schedule planning fails to produced an optimized one. Thus this feedback provides an effective mechanism to control fusion granularity.
5.1.3 Space Sharing
Space sharing is an effective technique to reuse shared memory space. As shown in the example in Figure 3. Reduce.2 reuses shared space allocated for Reduce.1. Divide.1 reuses shared space allocated for Exponential.1.
To facilite sharing, we first build a dominance treeCooper et al. starting from root instruction. Then we perform another round of data flow analysis using the dominance tree to realize space sharing. As to the case shown in Figure 3, shared space allocated for Reduce.1 can be shared after Expontial.1, and can be reused by Reduce.2 because Reduce.2 dominates Reduce.1. Similarly, Divide.1 dominates and reuses the buffer allocated for Exponential.1.
5.2 Code Generation
The schedule and shared memory planning setup the foundation for codegen. We build our work based on the hlo visitor framework available in XLA. The GpuElementalIrEmitter in XLA implements thread composition of computations. Algorithm 2 sketches the basic idea of our block composition procedure, IrEmitterStitched.
There are several inputs to IrEmitterStitched. is the target instruction to emit code for. and are outputs of schedule and shared memory planning, respectively. tells if is the root instruction. is similar to the map in XLA, the difference is on shared memory handling. If is not the output instruction, is neither BatchMatMul nor Reduce, and does not use shared memory as well, we fallback to ElementalIrEmitter in XLA; otherwise StitchedEmitter is called to emit code based on an optimized . When this is done, we store computation results to shared memory if required by calling EmitWriteSharedArray. If is true, which means is an output of the computation, code is emitted to write results to global memory via EmitWriteOutputArray. If is false, we insert an entry to map for , in order to support further composition of with other instructions. In implementation, we encapsulate codegen logic related to computation results, including shared/global memory handling into an OutputManager object, as shown in Figure 4.
6 Experimental Evaluation
6.1 Experimental Setup
We implemented FusionStitching on Tensorflow 1.7. Experimental results are collected on a Pascal GPU, with 3584 cores and 64KB shared memory per SM. Table 2 summarizes our benchmarks, ranging from small to medium public models to large inhouse applications in our production environments. LR, W2V, RNN and BiRNN are from publicaymericdamien . all with default configurations. Speech is an inhouse speech application, training voice samples collected from millions of consumer side portable audio systems.
is an inhouse variant of neural machine translation based on the attention mechanismVaswani et al. (2017); Xiong et al. (2018). There are two use cases. One is offline translation of descriptions of billions of commodities from one language to another. In this case, batch processing is available to maximize efficiency. The other use case is for realtime, online communication between sellers and buyers. In this case, batch size is small, and latency is critical. In both cases, every millisecond of performance imporvement is of significance in practice. There is strong incentive to optimize as much as possible beyond MatMul/Conv.
Our evaluation baseline is the XLA implementation of fusion and code generation. It is important to note that XLA has already done excellent work on common elementwise and producer/consumer patterns. With FusionStitching, we are interested in how much additional imporvement is possible for these workloads.
6.2 Fusion Potential Analysis
Optimization targets of FusionStitching are subgraphs of ops except library calls, which in our case only cuDNN and cuBLAS are relevant. Figure 6 shows execution breakdown between MatMul/Conv and other ops for all benchmarks. As can be seen, the potentially fusable component (the top portion) takes 20% to 50%. Large, dense MatMul/Conv ops are friendly to GPUs, but are computationally costly. In practice we tend to use less expensive ops for acceptable accuracy. In addition, some MatMul/Conv ops have particular sizes where performance gain is very marginal to call vendor libraries. Deep fusion and efficient code generation is critical for performance in these cases.
6.3 The Fusion Ratio
One important goal of fusion is to enlarge granularity, thus reduce the number of GPU kernels launched. We measure the ratio between the number of kernels of FusionStitching and that of the baseline (excluding library call kernels). We use nvprof to collect details of kernels information. The result is shown in Figure 7.
The fusion results depend on workloads. For most of them, the fusion rate is less than . This means FusionStitching can reduce the number kernels further to less than half the number of the baseline. W2V has the highest fusion ratio (), because the core computation pattern in this case is friendly to XLA, with limited room left for futher fusion. FusionStitching performs best on Speech (). In this case, there are complex interaction patterns among reduce, transpose, concat, and elementwise ops. FusionStitching handles them gracefully.
6.4 Performance Speedup
The ultimate goal of fusion and code generation is for performance. Figure 8 shows results for all workloads. We report three numbers for each benchmark. The (left) measures performance imporvement of the fusable portion only (in contrast to MatMul/Conv portion, as shown on the top of Figure 6). e use to denote the execution time ratio of the fusable portion. The end to end (E2E) speedup (right) measures performance speedup of the whole network. The predicted E2E (middle) predicts the actual E2E speedup using the following formula:
The ranges from (W2V) to (Speech). The average speedup (geometric mean) is . This speedup roughly corresponds to the reciprocal of the fusion ratio. The reason for this is that, in most fusion cases in these workloads, ops are generally fine grained, memory intensive. Fusing them together effectively reduces launch overheads and memory footprints. This motivates us to introduce the above empirical formula to predict E2E speedup. As can be seen, predicted speedups are close to measured E2E speedup numbers.
Besides (capability measurement), (potential measurement) also has strong impact on E2E performance speedups. E2E speedup from FusionStitching varies depending on workloads, ranging from to , with geometric mean .
6.5 Shared Memory Analysis
In FusionStitching, on chip shared memory is essential to composing numerous ops with different parallel loop emitters together. Table 3 summarizes shared memory usage behaviors. The Average column shows on average how much shared memory (in bytes) has been allcoated for each kernel. The Max column shows the maximum space (in bytes) allocated. We set a upper limit (currently 20KB) for shared memory usage of a kernel. Once the requested size exceeds this limit, the shrinking process is triggered. The #Shrink column shows how many kernels have triggered the shrinking process. Finally, the last column shows on average, the percentage of space that is shared by multiple ops of the total allocated space for the kernel.
Different workloads exhibit very different shared memory behaviors. LR, W2V, RNN and BiRNN have relatively simple producer/consumer patterns, and neither size shrinking nor sharing happens. Speech has large requirements for shared memory. This is in part due to large computation granularity. In addition, shape modulation ops (such as transpose, etc) sometimes result in large thread block sizes in schedule planning, increasing shared memory requirements. While little () allocated space is shared in Speech, this number is for NMT, indicating certain degree of computation results reuse in the graph, as illustrated in Figure 3. The pattern in this figure is one of the computationally intensive subgraphs of NMT.
7 Related Work
GPU kernel fusion, inspired from classical loop optimizationsDing & Kennedy (2004); Kennedy & Allen (2002), is known to boost performance in other application domains. In database domain, KernelWeaverWu et al. (2012) proposed transformations to fuse execution of multiple operators into a single kernel. This work provided support for both thread and block (CTA) composition of operators, yet with little support for tuning of implementation schedules. In the HPC domain, Wahib & Maruyama (2014) formulated GPU kernel fusion as an combinatorial search problem, and searched the solution space for an optimized fused kernel. Our work targets Tensorflow computation graphs, and proposes dedicated fusion, tuning and code generation to achieve high performance.
The parametric representation of the implementation schedule is inspired from HalideKelley et al. (2018) and TVMChen et al. (2018); tvm . However, instead of relying on users to specify schedule details, we propose a compact and efficient schedule specification, and tuning framework for Tensorflow graphs. Experimental results show decent performance gain on the fusable portion of the graph. The layered span graph, used in our fusion algorithm, is inspired from Work/Span Guy (1996) analysis of parallel computation DAGs and layered dependance graph representationMa et al. (2016) of stencil kernels.
In our work, we do not fuse dense DNN layers, and leverage vendor libraries for performance. However, there are recent advances on code generation of fast DNN kernels. Anderson & Gregg (2018) proposed a solution for selecting fast kernel implementations in the global context by formulating it as a PBQP problem. BodaMoskewicz et al. (2017) is a code generator that generates code for CNN layers on mobile platforms. LatteTruong et al. (2016) is a DSL system for DNN allowing users to specify, synthesize and optimize code for NN layers. SLINGENSpampinato et al. (2018) is another DSL system which takes mathematical specifications and generates optimized C functions for linear algebra operators with small input sizes. These research are relevant but complementatory to our work.
8 Conclusion and Future Work
In this paper we propose FusionStitching, a deep fusion and code generation system based on the XLA compilation framework for Tensorflow computations. Our system features a critical path analysis to drive fusion decisions, a novel domain specific schedule specification and tuning mechanism for kernels, and a shared memory optimization technique to enable composition of large kernels. Experimental results show notable reduction of GPU kernels, and reasonable E2E performance speedups on our benchmarks.
In practical workloads, many DNN layers only have small to medium sizes. With recent advances on DNN kernel generation, especially on powerful hardware with mixed precision functionality, it would be interesting to fuse DNN layers as well and solve a global optimization problem.
Tvm: Open deep learning compiler stack.URL https://github.com/dmlc/tvm.
- tor (2015) Torch nn, 2015. URL https://github.com/torch/nn.
- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Anderson & Gregg (2018) Anderson, A. and Gregg, D. Optimal dnn primitive selection with partitioned boolean quadratic programming. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, Vienna, Austria, 2018.
- (5) aymericdamien. Tensorflow-examples. URL https://github.com/aymericdamien.
- Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.
- Chen et al. (2018) Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. Tvm: An automated end-to-end optimizing compiler for deep learning. In Proceedings of Operating Systems Design and Implemention (OSDI), 2018.
- (8) Cooper, K. D., Harvey, T. J., and Kennedy, K. A simple, fast dominance algorithm.
- Ding & Kennedy (2004) Ding, C. and Kennedy, K. “improving effective bandwidth through compiler enhancement of global cache reuse. Journal of Parallel and Distributed Computing, 64:108–134, 2004.
- Guy (1996) Guy, B. Programming parallel algorithms. Communications of the ACM, Volume 39 Issue 3:85–97, 1996.
- Jia et al. (2014) Jia, Y. Q., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, G., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. pp. 675–678, 2014.
- Kelley et al. (2018) Kelley, J. R., Adams, A., Sharlet, D., Barnes, C., Paris, S., Levoy, M., Amarasinghe, S., and Durand, F. Halide: decoupling algorithms from schedules for high-performance image processing. Communications of the ACM, Volume 61 Issue 1, 2018.
- Kennedy & Allen (2002) Kennedy, K. and Allen, J. R. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. ISBN 1-55860-286-0.
- Klöckner (2014) Klöckner, A. Loo.py: transformation-based code generation for gpus and cpus. CoRR, abs/1405.7470, 2014. URL http://arxiv.org/abs/1405.7470.
- Ma et al. (2016) Ma, W. J., Gao, K., and Long, G. P. Highly optimized code generation for stencil codes with computation reuse for gpus. Journal of Computer Science and Technology, Volume 31 Issue 6:1262–1274, 2016.
- Moskewicz et al. (2017) Moskewicz, M. W., Jannesari, A., and Keutzer, K. Boda: A holistic approach for implementing neural network computations. In Proceedings of the Computing Frontiers Conference, CF’17, pp. 53–62, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4487-6. doi: 10.1145/3075564.3077382. URL http://doi.acm.org/10.1145/3075564.3077382.
- PAI (2018) PAI. Bringing tvm into tensorflow for optimizing neural machine translation on gpu. 2018. URL https://tvm.ai/2018/03/23/nmt-transformer-optimize.html.
- Spampinato et al. (2018) Spampinato, D. G., Traver, D. F., Bientinesi, P., and Püschel, M. Program generation for small-scale linear algebra applications. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, Vienna, Austria, 2018.
- Truong et al. (2016) Truong, L., Barik, R., Totoni, E., Liu, H., Markley, C., Fox, A., and Shpeisman, T. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pp. 209–223, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4261-2. doi: 10.1145/2908080.2908105. URL http://doi.acm.org/10.1145/2908080.2908105.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Wahib & Maruyama (2014) Wahib, M. and Maruyama, N. Scalable kernel fusion for memory-bound gpu applications. In Proceedings of SC’14, New Orleans, LA, USA, 2014.
- Wu et al. (2012) Wu, H. C., Diamos, G., Cadambi, S., and Yalamanchili, S. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, BC, Canada, 2012.
- Xiong et al. (2018) Xiong, D. Y., Li, J. H., Branco, A., Kuang, S. H., and Luo, W. H. Attention focusing for neural machine translation by bridging source and target embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 1767–1776, 2018. URL https://aclanthology.info/papers/P18-1164/p18-1164.
- Yu et al. (2014) Yu, D., Eversole, A., Seltzer, M., Yao, K., Kuchaiev, O., Zhang, Y., Seide, F., Huang, Z. H., Guenter, B., Wang, H. M., Droppo, J., Zweig, G., Rossbach, C., Gao, J., Stolcke, A., Currey, J., Slaney, M., Chen, G. G., Agarwal, A., Basoglu, C., Padmilac, M., Kamenev, A., Ivanov, V., Cypher, S., Parthasarathi, M., Mitra, B., Peng, B. L., and Huang, X. D. An introduction to computational networks and the computational network toolkit. Technical report, 2014.