LoopStack: a Lightweight Tensor Algebra Compiler Stack

05/02/2022
by   Bram Wasti, et al.
6

We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest. This stack enables us to compile entire neural networks and generate code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while incorporating optimizations often missing from other machine learning compiler backends. We evaluate our stack on a collection of full neural networks and commonly used network blocks as well as individual operators, and show that LoopStack generates machine code that matches and frequently exceeds the performance of in state-of-the-art machine learning frameworks in both cases. We also show that for a large collection of schedules LoopNest's compilation is orders of magnitude faster than LLVM, while resulting in equal or improved run time performance. Additionally, LoopStack has a very small memory footprint - a binary size of 245KB, and under 30K lines of effective code makes it ideal for use on mobile and embedded devices.

READ FULL TEXT VIEW PDF

Authors

page 7

page 8

page 11

page 12

11/08/2017

DLVM: A modern compiler infrastructure for deep learning systems

Deep learning software demands reliability and performance. However, man...
08/03/2018

A Compiler-Compiler for DSL Embedding

In this paper, we present a framework to generate compilers for embedded...
05/07/2020

TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning

In this paper, we demonstrate a compiler that can optimize sparse and re...
12/11/2019

Array Languages Make Neural Networks Fast

Modern machine learning frameworks are complex: they are typically organ...
10/01/2018

CHET: Compiler and Runtime for Homomorphic Evaluation of Tensor Programs

Fully Homomorphic Encryption (FHE) refers to a set of encryption schemes...
08/08/2017

On-Stack Replacement à la Carte

On-stack replacement (OSR) dynamically transfers execution between diffe...
03/04/2022

Deoptless: Speculation with Dispatched On-Stack Replacement and Specialized Continuations

Just-in-time compilation provides significant performance improvements f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The availability of massive amounts of computing power has fueled the explosive growth of machine learning techniques for almost a decade and has greatly affected the design of modern hardware. Companies, such as NVIDIA with their tensor cores, Intel/AMD with their specialized AVX, FMA, and VNNI instructions, ARM with the Neon and SVE extensions to their instruction set, and Apple with their M1 CPU and its matrix coprocessor, have tailored the computational capabilities of their respective hardware to better serve the needs of deep learning workloads.

However, leveraging the raw computational power of such hardware for the purpose of fast execution of machine learning models remains a challenge. Several approaches have been proposed over time, but they all suffer from significant shortcomings.

The first approach is to rely on libraries of optimized primitive tensor operations, such as cuDNN (chetlur2014cudnn), OneDNN (onednn), or XNNPACK (xnnpack)

. Unfortunately this approach has three main downsides. First, to cover an ever growing set of use cases, these libraries tend to be large in codesize, which hinders their utilization on many devices, such as smartphones or IoT devices. Second, their development and maintenance require large amounts of manpower. Third, operators can only exchange data through global memory, which can be a significant bottleneck, especially in the case of low arithmetic intensity operators such as the activation functions.

To avoid these problems, another approach pioneered by projects such as Halide (ragan2013halide) and TVM (tvm) represent tensor computations, such as the ones underlying deep learning, using a declarative domain specific language. This high-level representation is then scheduled and lowered into LLVM intermediate representation, and then compiled into instructions that can be executed directly on hardware by the LLVM compiler. However this approach suffers from very large compilation times, thus making certain techniques, such as fast auto-tuning impractical.

To overcome these limitations, we introduce LoopStack, a lightweight toolchain designed specifically for deep learning workloads. To express computations, LoopStack introduces a domain specific representation based on Einstein’s notation. This representation can capture many common dense neural network operations in a concise form. LoopStack also provides a frontend capable of converting neural networks, expressed as dataflow graphs, into our representation. Finally, we present a code generator that can extremely quickly convert this representation into highly optimized code for X86 and ARM CPUs.

Unlike other approaches that rely on off-the-shelf components such as LLVM, LoopStack was designed from the ground up to target machine learning workloads. This enables our system to make the following contributions:

  • Milliseconds compile times, a fraction of time compared to traditional compilers.

  • Performance comparable to or exceeding that of hand–optimized libraries such as MKL–DNN or XNNPACK.

  • Small code footprint resulting in a lighweight binary.

2. Tensor Contractions

Tensor operations, such as inner, outer, and cross product, as well as matrix multiplication, trace, transpose, and many other tensor operations can be expressed as special cases of contraction. In fact, with a few extensions, contractions can encode all the operations that the most popular deep neural networks rely on. We demonstrate how to express computations over tensors as generalized contractions. We also discuss the challenges inherent in optimizing (aka scheduling) these contractions.

2.1. Generalized Tensor Contractions

Before introducing generalized, -dimensional tensor contractions, we first present a simple example in -dimensions – the matrix multiplication operation. Figure 1 demonstrates two different, though equivalent, ways that describe the matrix multiplication operation. The imperative pseudocode in Figure 0(a) shows the typical set of 3 nested for-loops iterating over dimensions and of the output matrix () and the reduction dimension . Note the reduction over is explicit in this notation.

In the tensor index (Einstein) notation (ricci1900methodes; einstein1923grundlage; kjolstad2017tensor) (Figure 0(b)), the and index variables appear on both the left– and right–hand side of the formula indicating that in order to compute the corresponding entry in we index and with the same values of and . The index variable , represents a reduction variable, as it appears solely on the right–hand side the formula. In particular, the for–loop over varying values of for a specific assignment to and , and the corresponding aggregation (i.e. contraction), is left implicit. Similarly, the for–loops over varying values of and , needed to populate , are left implicit.

We use generic and notation to emphasize that different operators can be used to instantiate a generalized matrix multiplication operation.

for i:
    for j:
        C[i, j] = alpha * C[i, j];
        for k:
            C[i, j] += beta * A[i, k] * B[k, j]
(a) Pseudocode for basic imperative matrix-multiplication.
(b) Equivalent declarative Einstein notation for matrix multiplication, with and .
Figure 1. Matrix multiplication

We can generalize the matrix-multiplication example to –dimensions, where our inputs are now –dimensional tensors, to result in tensor contractions. Following notation presented in (gareev2018high), let , , and now be d-dimensional tensors (for potentially different ). Let the set of dimensions for , , and be denoted by , , and , respectively. Let be the dimensions of present in , while are reduction dimensions111Dimensions contracted by a user-defined operation, such as the shared dimension of the input matrices in a matrix multiplication. in . Similarly are dimensions of present in , while are reduction dimensions in . We now write:

We further extend this notation to allow indexing into a tensor to be an affine transformation of the indices.

Finally, we extend our notation to allow for an element-wise operation that initializes the output tensor (pre-operation) and an element-wise operation that can transform a value before final storage into the output tensor (post-operation). We can now write

where , and are affine transformations and pre/post are the element-wise pre- and post- operations. These concepts are useful for expressing biases and activation functions in machine learning workloads.

With these extensions, we can not only express GEMM, GEMV, and GEVM operations, but many more operations as well. Such as:

  • Convolutions –

  • pooling –

  • reductions –

  • transpositions –

  • concatenations – ;

  • broadcast –

2.2. Optimizing Generalized Contractions

The way we carry out these generalized contraction computations has a significant impact on performance. For example, a naive 3-nested for-loop implementation of matrix multiplication of figure 0(a) will result in long running times for non-trivially sized matrices. In contrast, efficient implementations, as proposed by many researchers over the past few decades (goto2008anatomy; park2000efficient; smith2014anatomy; lam1991cache; gunnels2001family; jia2018optimizing; zlateski2019anatomy)

, might modify the imperative loops to tile (i.e. split into sub-matrices of appropriate sizes) the input matrices to align with architecture-dependent resources such as cache sizes, re-order loop dimensions to re-use data in the innermost loop, exploit architecture features such as vector instructions to operate on multiple elements at a time, and emit extended instructions such as fused-multiply-add, which combine multiplication and accumulation at the instruction level. Further, the data might be kept in

exotic memory layouts, such as the channel–interleaved format often used on Intel machines (onednn; zlateski2017compile), or the matrix tile interleaved format proposed by Jia et al. (jia2018optimizing; zlateski2019anatomy).

Typical scheduling operations include: re-ordering, splitting, fusing, parallelizing, vectorizing, and unrolling loops  (ragan2013halide; grosser2011polly)

. The scheduling options for a single task such as matrix-multiplication highlight the combinatorial challenge underlying scheduling. Different computations and platforms require varying schedules to deliver performance gains. This results in a challenging optimization problem. Historically, varying techniques have been put forward to tackle this optimization, ranging from expert-based manual optimization, analysis-driven heuristic optimization  

(grosser2011polly; leben2019polyhedral), to recent advances in data-driven automated schedule exploration (tvm; adams2019learning)

In the remainder of this paper, we introduce both the frontend (LoopTool) and backend (LoopNest) of LoopStack as well as a TUI for manual tuning and a script for automatic tuning built on top of LoopStack.

3. LoopStack Design

LoopStack is composed of a frontend (LoopTool) and a backend (LoopNest). This design separates the representation of the program being described from the logic to emit target-specific code.

The frontend (LoopTool) is designed to enable expression of both the user’s mathematical intent (a chain of extended tensor contractions) as well as the preferred execution of the program – the order in which basic mathematical operations should be executed (i.e. the schedule) to perform the desired tensor contraction. This is done by defining a structured intermediate representation composed exclusively of a dataflow graph (DFG) as well as a loop order annotation language that denotes how each node is executed. The omission of control flow in the representation of the program is consistent with a large majority of neural network, which can be well represented as purely functional operations on N-dimensional tensors. The design of this IR – the design of our DFG, as well as the loop order annotations, is explained thoroughly in section 4.

The backend, LoopNest, is designed to generate optimized machine–code efficiently that strictly adheres to the user’s input. LoopNest exposes an API that allows the user to define a specific loop nest order in which the computation should be performed; LoopNest will not make any attempt to change the provided nest order or use any logic to recover user intent. In short – LoopNest will generate machine code that performs the exact computation that the user requested, and in the same order. However LoopNest will attempt to apply all known low level optimizations that are relevant to the target hardware, such as loop unrolling(jiang2018efficient; zheng2018optimizing; zhao2018hardware; wiki:Loop_unrolling; kennedy2001optimizing; jeffers2013intel; spampinato2014basic), vectorization (zheng2018optimizing; kennedy2001optimizing; cornea2015intel; jeffers2013intel; spampinato2014basic), reordering of independent instructions (jiang2018efficient; zhao2018hardware; wiki:Instruction_pipelining; kennedy2001optimizing; cornea2015intel; jeffers2013intel; spampinato2014basic), software pipelining (zhao2018hardware; wiki:Instruction_pipelining; kennedy2001optimizing; cornea2015intel; jeffers2013intel; spampinato2014basic), minimizing the amount of auxiliary operations. More about LoopNest code generation and optimizations is discussed in section 5.

3.1. Design principles and goals

3.1.1. Simplicity

The goal of simplicity extends to both the description of intended mathematical operations and the language of defining the execution order (the schedule) of the resultant loops.

We use an Einstein-like notation (einstein1923grundlage) embedded in Python, described in 4

, as it allows the user to easily describe the intended mathematical operations even for complex machine learning models. For instance, defining a multi layer perceptron (MLP), which is often used for NLP , reccomendation  models, etc.., can be achieved as seen in in the Listing 

3.1.1.

[H] [commandchars=
{}] HL1[b, i] += W1[i, j] * Input[b, j] Act˙1[b, i] = lt.relu(HL1[b, i]) HL2[b, k] += W2[k, i] * HL1[b, i] # … Out[b] = lt.sigmoid(WN[o, p] * HLN[b, p]) MLP defined with our Einstein notation-like syntax

We provide a minimal IR and limit the API such that almost all optimization operations can be decomposed into a series of operations on individual nodes in the DFG. Wrappers are provided to manipulate nodes in bulk (e.g. to change the loop order of both the multiplication and the addition in a matrix multiplication at once), yielding an interface similar to Halide’s pipeline scheduling states(adams2019learning). However, unlike in Halide, it is impossible to represent illegal schedules in our IR. We discuss the optimizations in detail in section 6.

There are only a handful of global optimization parameters, such as the unroll limit and the parameters describing the target hardware, most of which are encoded inside the LoopNest.

3.1.2. Performance Predictability & Fast Feedback

Traditional compilers perform expensive analysis in order to understand user intent and optimize execution, while performing equivalent computation. This approach has two major drawbacks. First, the user does not get accurate feedback about the quality of their intended schedule of execution; and second, the compilation time is significantly increased due to expensive analysis. LoopStack takes a different approach. Once the user annotates the IR, no layer in LoopStack attempts to understand the user’s intent and/or reorder the intended schedule provided by the user, except in very few, hardware–specific cases. Currently, these cases only include reordering mathematical operations inside the innermost unrolled loop. Such optimizations have relatively minor effect on the overall performances – they increase the utilization of the underlying hardware by a few percentage points, but also allow the user to squeeze the last bits of the available computational power. In addition, such optimizations highly depend on the limitations of the target hardware, and are not easy to expose to the user.

This approach allows LoopStack to perform extremely fast compilation (on the order of milliseconds), as well as allow the user to understand the quality of their intended schedules and rapidly explore scheduling ideas. In addition, this approach can greatly benefit autotuning (adams2019learning; zheng2020ansor) algorithms by leveraging both the predictability and efficiency of benchmarking schedules.

4. Frontend (LoopTool)

In order to allow users to effortlessly describe the desired computation of neural network workloads using LoopStack, we built a domain specific language (DSL) embedded in Python and a compiler front-end that we termed LoopTool. LoopTool exposes a declarative API for user definition of computation and operates on an IR composed of an annotated dataflow graph (DFG) consisting of N-dimensional tensor operations. The DFG describes the underlying computation, memory layouts and execution. LoopTool then lowers the DFG into a series of loops, which are then compiled with LoopNest.

4.1. Declarative API

import loop_tool as lt
M, N, K = lt.Var("m"), lt.Var("n"), lt.Var("k")
A = lt.Tensor([M, K])
B = lt.Tensor([K])
C = lt.Tensor()
C[m, n] += A[m, k] * B[k]
Figure 2. LoopTool’s Python embedded declarative DSL

Figure 2 shows an example of matrix multiplication in LoopTool’s declarative Python DSL. The expression lt.Var(”m”) defines an indexing variable with the corresponding name. Next, the expression lt.Tensor([M, K]) defines a 2-dimensional tensor. Note that LoopTool uses symbolic (i.e. named) dimensions (namedtensordims), which simplify indexing semantics and encourages a simple interaction model for manipulating traversal and memory layouts of higher dimensional tensors. The expression C[m, n] += A[m, k] * B[k] defines computation using the aforementioned tensor (Einstein) notation; here k is a reduction dimension. LoopTool’s computation language supports element-wise computations (with broadcast semantics), associative reductions across arbitrary dimensions, and a restricted set of indexing semantics (see Section 4.2.3).

4.2. Intermediate Representation (IR)

LoopTool’s DFG is based on an intermediate representation with annotations on each node. Each node is associated with its output – a virtual buffer that is materialized according to the user provided schedule. Figure 3 (left) demonstrates a matrix multiplication without annotations. A node’s output size is not materialized until the IR is lowered to loops. The materialization logic always attempts to minimize the total memory used. For example, two nodes operating on virtual buffer of size in a shared loop over do not need to allocate the full elements of memory for their intermediate. Instead, the intermediate will be of size 1 and reused times. This can be seen in Figure 4.3.

LoopTool’s DFG has three fundamental types of nodes.

[scale=0.5]matmulir read_a -¿ mul; read_b -¿ mul; mul -¿ add -¿ write_c; read_a[shape=record,label=” read(0) — %a[m, k] ”]; read_b[shape=record,label=” read(1) — %b[k, n] ”]; mul[shape=record,label=” multiply — %tmp[m, k, n] ”]; add[shape=record,label=” add — %c[m, n] ”]; write_c[shape=record,label=”write(2)”]; [scale=0.5]examplelowering read -¿ f -¿ write; read[shape=record,label=” read(0) — %a[x, y, z] ”]; f[shape=record,label=” f — %b[x, y, z] ”]; write[shape=record,label=”write(1)”];
Figure 3. Matrix multiplication in LoopTool (left). A point-wise application of the function across all three dimensions of %a (right).

4.2.1. Read/write nodes

Read and write nodes denote reads or writes from and into user provided memory, as well as associated layouts and sizes. These nodes contain an ordered list of symbolic dimensions. The ordered list of dimensions represents a row-major order (lexographical order) of either input or output memory. Because LoopStack is embedded in either Python or C++ applications, these nodes are used to interface with other operations that are not handled by LoopStack. An example would be the specification of NCHW(pytorch) or NHWC(abadi2016tensorflow) layouts for the inputs to convolution operations, both of which are well handled and trivially manipulated in the LoopTool IR. In the IR, read nodes take no inputs whereas write nodes take a single input and have no outputs. These are fundamentally source and sink operations in the graph.

4.2.2. Arithmetic nodes

Arithmetic nodes operate on virtual buffer and output a single virtual buffer. The ordered list of dimensions associated with these nodes denotes how output memory should be laid out (given the corresponding scope of their execution).

Arithmetic nodes, unlike read and write nodes, can take multiple inputs. This represents the typical operation applied to the inputs. For example, addition of two inputs works as expected to yield a single output.

Extending this concept to higher dimensions forces us to consider when two inputs do not have the same dimensions. Due to the symbolic nature of the dimensions in LoopTool, we can distinguish two dimensions of the same size as being mathematically distinct. To handle application of arithmetic in higher dimensions we employ implicit broadcasting akin to Numpy(oliphant2006guide) semantics. Dimensions not present in the output are implicitly reduced over according to the arithmetic of the associated node.

4.2.3. View nodes

The final type of node admits the representation of symbolic indexing constraints over tensor dimensions, allowing powerful view semantics. Any affine combination of iteration over the output dimensions can be used to index into an input dimension. This enables us to represent windowed operations as well as concatenations. By using index constraints rather than index equations, we preserve the meaning of the underlying computation and can freely split and permute variables and layouts. Further, by keeping indexing math symbolic, we can still apply the scope based memory minimization logic.

4.3. Lowering to loops

LoopTool lowers its IR to an internal loop tree structure before invoking LoopNest. This is done by traversing the DFG topologically and eagerly emitting loops that are required for each node. A reference to the current innermost loop is maintained throughout this entire process, with each subsequent loop nesting inside the reference. If any node requires a loop that is an ancestor to the current reference, there is no need to emit the loop again (it would be incorrect to do so) and it is skipped. This naturally induces loop coalescing (typically referred to as loop fusion).

Consider nodes shown in Figure 3 (right). We can assume an annotation of %a and %b both with loop order [x, y, z]. When visiting %a’s node, we emit the loop tree shown in Listing 4.3.

Later, while visiting %b’s node, we note that the reference (which is at location %a) contains a loop for x, y, and z in order. We thus reuse all loops and implicitly ”fuse” node %b. This is shown in Listing 4.3

We find the nodes can be executed in the same innermost loop. The resultant allocation size of %a would be in this case.

[H] [commandchars=
{}] iter x: iter y: iter z: %a[x, y, z] = read(x, y, z) Lowering %a emits three new loops.

[H] [commandchars=
{}] iter x: iter y: iter z: %a[x, y, z] = read(x, y, z) %b[x, y, z] = f(%a[x, y, z])

Lowering %b reuses all loops.

However, if the nodes had different loop annotations, such as %b annotated with [x, z, y], we would not be able to share every loop. The memory allocated for %a in that case would be of size This is shown in Listing 4.3.

[H] [commandchars=
{}] iter x: iter y: iter z: %a[x, y, z] = read(x, y, z) iter z: iter y: %b[x, y, z] = f(%a[x, y, z])

Only loop x is shared across the two nodes.

In the case of reductions, we cannot always share loops. Consider a reduction node %R over variable z with loop order [x, y, z] depended on by by %a with the same loop order; see Listing 4.3. The loop for z is required to run twice for correctness. This type of behavior, while necessary for reductions, may be preferable in other contexts as well.

[H] [commandchars=
{}] iter x: iter y: iter z: %R[x, y] = reduction() iter z: %a[x, y, z] = %R[x, y] +

Sharing loop z between %R and %a would be mathematically incorrect.

Manually preventing loop sharing can induce larger intermediate memory allocations. This is often beneficial when computation benefits from packing memory into cache–friendly layout before computation. To express this, LoopTool has a second form of annotation for nodes called staging that prevents reuse of specific loops. In listing 4.3, %a and %b share an entire loop nest. If we were to stage the loop for %a over z, the resultant lowering would increase the allocation size of %a to (Listing 4.3).

[H] [commandchars=
{}] iter x: iter y: iter z: %a[x, y, z] = read(x, y, z) iter z: %b[x, y, z] = f(%a[x, y, z]) z is staged, so %a is materialized with an allocation of size .

5. Backend (LoopNest)

LoopNest is a domain specific compiler for a series of nested loops, with user specified innermost operation. LoopNest is designed to have extremely short compilation times and use well known and studied high-performance computing (HPC) optimizations to generate high performance code.

5.1. LoopNest’s HPC Philosophy

While traditional compilers typically perform multiple optimization passes, some of which might be repeated, LoopNest’s is designed to only do a limited number of optimizations, but do them very well. These include HPC optimizations that are commonly used in expert designed, custom assembly, or code generator primitives for specific problems, such as matrix multiplications (van2015blis; barrachina2008evaluation; wang2014intel; heinecke2016libxsmm; heinecke2015libxsmm; khudia2018open) and other primitives used in machine learning (chetlur2014cudnn; zlateski2018deeper; zlateski2019anatomy; jia2018optimizing; onednn; elsen2020fast; heinecke2016high). Additional HPC optimization might include instruction reordering, r–sum, and other optimizations suggested by optimization manuals for the target hardware (intel2014intel; arm2016arm).

LoopNest’s generated code directly reflects the computation requested by the user: in particular, operations are performed in user-specified order. Generating code directly from the user-defined execution order simplifies code generation logic, which results in a more intuitive mapping between the quality of the provided execution order and the observed performance of the code. In addition, the provided (high level) information about the loop order and sizes are used for LoopNest’s simple register allocation approach, described below.

While having default choices based on the target hardware manuals, LoopNest optionally delegates the choice of which loops to be unrolled. The user may override the hardware tuned default maximum number of unrolled instructions, and LoopNest will determine the outermost loop such that the number of unrolled instructions doesn’t exceed the user specified value.

Vectorization

LoopNest assumes that the user intent is to vectorize the innermost loop in the user provided schedule. This design simplifies the logic, while not introducing any limitations to the user – the elements of the vector operation are executed at the same time, thus naturally belong to the innermost loop. LoopNest will, thus, attempt to vectorize the innermost loop. However, in certain scenarios, when the data accessed inside the innermost loop is not contigious, and the target hardware doesn’t support efficient gather operations, LoopNest will fallback to scalar operations.

No–Spilling of the Resulting Tensor

The concept of tiling (or ”blocking”) for the multiple levels of a cache hierarchy and the register file is a well known optimization technique (goto2008anatomy; gunnels2001family; lam1991cache; smith2014anatomy); referred to as Cache Blocking Techniques in the Intel’s optimization manual (intel2014intel). LoopNest exposes this common HPC optimization, where the subset of the output tensor is kept in register file (onednn; heinecke2015libxsmm; jia2018optimizing; zlateski2018deeper; zlateski2019anatomy; heinecke2016libxsmm; heinecke2016high). LoopNest never produces code that spills the content of the register file to the stack, an approach that is commonly used in traditional compilers, such as LLVM or GCC. Spilling the content of the register file to stack puts pressure on the hardware’s load/store units, preventing full hardware utilization.

LoopNest does not decide on blocking or tiling sizes, nor on the size used for the data kept in registers (register blocking). LoopNest instead identifies the outermost user provided loop for which the all compute can be performed with a subset of the output tensor kept in registers. Thus, it’s up to the user to provide a well chosen loop order and sizes – one where the values kept in the register file can be reused many times. This approach gives the user a greater control, with more predictable performance as compared to the case when spilling is allowed.

5.2. Single Operand LoopNest

LoopNest also provides functionality for generating efficient code for a simplified tensor contraction, where there are no reduction dimensions, and only one input is provided. This functionality is typically used for reshaping a tensor (such as numpy’s reshape function), but can also be used for extracting a subset of a tensor into a smaller tensor, or broadcasting elements along a tensor dimension. This functionality is required by LoopStack in order to allow the user’s schedules that prefer to reorganize the memory for faster access (goto2008anatomy; gunnels2001family; lam1991cache; smith2014anatomy)

5.3. Generalizing to computations with multiple loop nests

Some computations or their schedules can result in a sequence of multiple nested loops that may share a set of outer loops, effectively forming a tree of loops. To generalize our approach to these workloads, we developed a loop tree interface. The loop tree interface provides a simple API to build up a tree, where inner nodes correspond to for-loops, and leaves correspond to an innermost computation over tensors or a transposition of tensors. LoopNest then compiles all independent nests and executes the tree. The final result is a function that can be called with the appropriate input, intermediate, and output tensors to realize the computation defined by the tree.

6. IR Manipulation and Optimization

6.1. Interaction Model

A core challenge in high-performance tensor operations is identifying a good execution order – the schedule. As noted previously, there can be many valid schedules for a given computation: loops can be split and reordered, dimensions can be split and swapped, intermediate tensors can be packed, and so on. As a result, any scheduling tool is faced with the challenge of an exponentially large search space making simple enumeration infeasible. Recent work (zheng2020ansor; adams2019learning) has explored using a combination of machine learning and structured search strategies to automatically explore the search space.

While automation presents a promising direction, expert intuitions (and their use in a guided search) remain an important factor in developing high performance kernels today. As a result, a key goal of LoopStack is to facilitate exploration of schedules. Effective exploration depends on real-time feedback, which an expert can use to validate their intuitions and incrementally explore the space of possible schedules. To deliver real-time, instantaneous (nielsen1994usability), feedback, LoopTool exploits LoopNest’s fast code generation to compile and benchmark schedules with low-latency (on the order of 10-100ms for programs not bounded by the execution of the underlying generated code).

6.2. Manual Tuning

To promote human driven exploration with LoopTool, we developed a terminal based user interface (TUI). Fig. LABEL:fig:manual-demo-bench-0 presents a matrix multiplication example using LoopTool’s visual TUI. The schedule for the computation is reflected in the TUI and provides controls to interactively manipulate the schedule, including splitting and reordering loops and memory. LoopTool can automatically benchmark the schedule presented and display statistics useful for optimization. As shown in Fig. LABEL:fig:manual-demo-bench-0, LoopTool reports the total required number of operations (FLOPs), arithmetic intensity (The ratio of FLOPs performed and memory transfers (williams2008roofline; williams2008roofline; ilic2014cache)), and the achieved performances in billions of floating point operations per second (GFLOPS). This information allows experts to iterate on kernel schedules quickly, and non-experts to ramp up on schedule and hardware performance characteristics.

The speed with which a developer can experiment with different schedules allows exploring solutions optimally tuned for the computation and hardware at hand. For example, LoopTool can quickly show the impact of different layouts for inputs or output, or the use of intermediate memory (shared across sequentially composed kernels) sized for the system’s cache.

An example of manual optimizations on an AVX512 capable CPU with peak performances of around 190 GFLOPS can be seen in figures LABEL:fig:manual-demo-bench-0,  LABEL:fig:manual-demo-bench-1, and  LABEL:fig:manual-demo-bench-2. The initial, trivial schedule, as well as the poor performances (less than 1% utilization) achieved, are shown on Fig. LABEL:fig:manual-demo-bench-0. The Fig. LABEL:fig:manual-demo-bench-1 shows the process of splitting and reordering loops, which resulted in around 40% utilization. Finally, the Fig. LABEL:fig:manual-demo-bench-2 shows the process of introducing memory packing, resulting in a final schedule that results in utilization of 88.5% on average (over many invocations of the compiled schedule).

6.3. Scripted Tuning

As we have discussed above, an efficient tensor contraction schedule will use cache/register blocking (also referred to as tiling), a technique that improves performance by increasing data locality that has been studied for forever (stone1970logic; wolfe1989more; brown1975numerical; wolf1991data; irigoin1988supernode; bouwmeester2012tiled; kurt2020efficient; hong2019adaptive; wiki:lnopt; metcalf1980fortran). An expert, familiar with such techniques, as well as the design and limitations of modern hardware can easily script a reasonable sweep (on the order of hundreds to thousands of schedules) of parameterizations of these common techniques.

To demonstrate this, we wrote a simple three step automatic tuning script. Clocking in under 1000 lines of code, this script sweeps through possible tile sizes that are expected to perform well on modern hardware (as presented in (goto2008anatomy; smith2014anatomy)) and runs on the order of seconds to minutes on a single core for most individual neural network operations or blocks. The script works as follows.

First we define a set of tile sizes (based on modern hardware’s cache sizes) that will be used in conjunction with a set of register blocking sizes (based on current hardware’s register file size). This approach mimics the last two levels of what both Goto (goto2008anatomy) and Smith (smith2014anatomy) refer to ”Anatomy of High–Performance Matrix multiplication”. Given the size of the register file of a given hardware, we sweep over different register blocking sizes that fit in the register file. For each such register blocking size, we also sweep over blocking sizes (as well as order of compute (stone1970logic; wolfe1989more; brown1975numerical; wolf1991data; irigoin1988supernode; bouwmeester2012tiled; kurt2020efficient; hong2019adaptive; wiki:lnopt; metcalf1980fortran)). We then benchmark every combination of tile size and register block, saving the best configuration for the next step.

In the second step we partially sweep the set of every permutation of the loop order. At this point, with both tiling and register blocking, the loop nest tends to be on the order of 10 or so deep, which makes a full sweep infeasible (given our time constraint of minutes or less). We approximate tuning for every permutation by instead permuting windows of the loop nest. Starting with the innermost 5 loops, we try all combinations keeping the outermost loops fixed. We then continue this process upward, saving the best nesting order each time.

Finally, we attempt to inject memory packing to increase cache use. This involves a mix of permuting virtual memory layouts to align with later reads and writes and staging loops to induce memory usage. This helps with processor architectures that benefit from memory speed and problem sizes that are compute bound with large inputs.

It is important to note that the above optimizations vastly reduce the search space of schedules by relying on linearizing assumptions. We feel the time saved during the search is a good compromise that enables realistic use of tuning in research or production settings.

7. Evaluation

We performed a series of experiments in order to answer the following questions:

  • How does the compile time of LoopNest compare to the one of traditional compilers, such as LLVM?

  • How does the generated code perform as compared to one obtained with traditional compilers?

  • How does code generated by LoopTool compare against state of the art ML primitives, as well as end–to–end models?

  • How easy is it to extend LoopTool to support new sets of instructions?

7.1. Setup

Experiments are performed on total of 6 different CPU cores and 4 different vector extensions: Intel Xeon Platinum 8124M with the avx512 vector extension; AMD Epyc 7R32, with the avx2 vector extension; Cortex A57, Cortex A73, NVIDIA Denver 2 and Apple M1 CPUs – all with ARM’s neon vector extension; and finally, Apple M1 with the neon_fp16 vector extension. We perform all experiments single–threaded; using a single core on the target hardware, in order to analyze the contributions of our work. Using multi–threaded workloads would also depend on the underlying threading implementations, and can skew our desired measurements.

7.2. Operator Benchmarking vs Traditional Compilers

A key contribution of LoopNest, the backend of LoopStack, is the reduced compile times required for generating high performance code. To validate this contribution, we compare LoopStack to LLVM, which is a popular backend choice for existing tensor operation compilers such as Halide (ragan2013halide; steiner2021value) and TVM (tvm; autotvm).

Figure 1 presents 12 operator we have evaluated, consisting of matrix multiplications (fully connected layers), convolutions, and depth-wise convolutions over varying input sizes. To provide a fair comparison of LLVM and LoopStack, we evaluate both systems on the same set of schedules and perform pair-wise comparisons. For LLVM code generation, we use Halide to emit schedules identical to the ones used with LoopStack. We ensure that LLVM’s runtime assertions, as well as bound checks, are turned off. As LLVM is a general, catch–all compiler, performing apples–to–apples comparison is not an easy task; for that reason, we instruct LLVM to use the same optimizations as the ones used in practice – such as the ones used in Halide (gareev2018high; steiner2021value; adams2019learning) and TVM (tvm; autotvm). We focus our evaluation on two key dimensions: compile time and execution time.

We generate a set of schedules with the rule-based heuristic scheduler referred to in Section 6.3, which performs a sweep of execution schedule optimizations such as tiling. We attempt to collect schedules seen throughout this tuning process, but only collect schedules that result in a single loop nest (i.e. there are no shared outer loops across different nested inner loops). After we obtain a full set of schedules for a benchmark, we evaluate both LoopStack and LLVM on them.

Operator Sizes Benchmark Name
GEMM 64 x 64 x 64 MM-64
GEMM 128 x 128 x 128 MM-128
GEMM 256 x 256 x 256 MM-256
GEMM 512 x 512 x 512 MM-512
Convolution

(channels:64 to 128), size:56x56, filter:3x3, stride:1

CONV-1
Convolution (channels:128 to 256), size:28x28, filter:3x3, stride:1 CONV-2
Convolution (channels:256 to 512), size:14x14, filter:3x3, stride:1 CONV-3
Convolution (channels:512 to 512), size:7x7, filter:3x3, stride:1 CONV-4
Depth-wise Convolution (channels:16), size:112x112, filter:3x3, stride:2 DWCONV-1
Depth-wise Convolution (channels:72), size:56x56, filter:3x3, stride:2 DWCONV-2
Depth-wise Convolution (channels:88), size:28x28, filter:3x3, stride:1 DWCONV-3
Depth-wise Convolution (channels:240), size:14x14, filter:5x5, stride:1 DWCONV-4
Table 1. Summary of operator benchmarks for LoopNest/LLVM comparison

Figure 2 presents a summary of compile times across benchmarks. The “LLVM” and “LoopNest (LN)” columns present the average number of milliseconds to generate machine code for schedules for the corresponding benchmark. The “ratio” columns presents the average ratio of LLVM compile time relative to LoopNest compile time. Our experiments show that LoopNest’s code generation is faster than LLVM’s compilation for all schedules, all workloads and on all target hardware. In most cases LoopNest can generate code orders of magnitude faster. The significantly shorter compilation times enable use of LoopNest in automated schedule searches.

  x86 based CPUs   Aarch64 (ARM) based CPUs (NEON)
  AMD (AVX2) Intel (AVX512)   Cortex A57 NVIDIA Denver2 Cortex A73 Apple M1
  LLVM LN Ratio LLVM LN Ratio  LLVM LN Ratio LLVM LN Ratio LLVM LN Ratio LLVM LN Ratio
CONV-1  820.78 2.5153 326.31 9881.9 6.2062 1592.3  2948.4 9.2109 320.1 3972.4 30.28 131.19 4914.6 22.077 222.61 555.24 24.763 22.422
CONV-2  1590.1 11.717 135.7 9531.3 5.8035 1642.3  2481.1 8.9475 277.29 4798.5 21.768 220.44 4310.1 31.053 138.8 531.12 15.261 34.802
CONV-3  762.52 5.9325 128.53 11061 11.411 969.29  3113.2 17.749 175.4 4184.3 24.184 173.02 3919.1 25.51 153.63 444.29 20.435 21.742
CONV-4  885.74 41.163 21.518 10706 14.264 750.56  4084.3 66.102 61.788 5169.9 85.97 60.136 6408.8 99.952 64.119 827.22 35.605 23.233
DWCONV-1  838.46 0.29416 2850.3 10551 0.28027 37647  2775.9 3.8011 730.28 4108.2 5.0464 814.09 4844.8 2.8047 1727.4 571.46 1.209 472.67
DWCONV-2  1033.8 0.47162 2192 11812 0.67874 17402  2795.3 2.5241 1107.4 3814.5 5.2036 733.06 4160.9 1.8825 2210.3 470.53 0.5241 897.79
DWCONV-3  969.88 0.30031 3229.6 13759 1.4601 9423.3  2726.6 2.158 1263.5 3666.1 4.2326 866.15 3320 3.8216 868.74 506.63 0.66237 764.87
DWCONV-4  1000.7 0.33429 2993.4 10704 0.81343 13159  3474 5.5112 630.34 4577.5 9.4771 483.01 4197.4 2.8767 1459.1 449.89 1.6 281.18
MM-64  697.37 0.38452 1813.6 9099.5 0.85309 10667  3432.4 7.3588 466.43 4779.2 13.103 364.75 4417.6 9.1708 481.7 397.59 1.3267 299.68
MM-128  925.47 1.578 586.47 9044.8 0.79484 11379  4991.9 11.153 447.57 5754.2 16.025 359.07 6681.1 13.956 478.72 387.82 1.1188 346.63
MM-256  1118.5 2.6925 415.41 18119 3.0202 5999.4  5003.1 18.511 270.28 5770.9 26.485 217.9 6800.6 27.446 247.78 390.29 5.1647 75.57
MM-512  1262.3 4.3405 290.83 12336 4.4847 2750.8  4814.2 7.4852 643.17 5698.9 12.25 465.22 6471.7 14.545 444.94 1084.4 9.5121 114
Table 2. Compile times, in milliseconds, for LLVM and LoopNest. LoopNest performs the compilation orders of magnitude faster

In Figure 2, we show the average run–times of the top 5 fastest schedules, for each compiler. While LLVM does outperform on some very inefficient schedules (not presented), we decide not to use slower schedules in our analysis, as such schedules will not be used in practice. Our results suggest that LoopNest is generating code that has at least comparable performances, or exceeds the performances of the one generated by LLVM – while taking just a fraction of time. Additionally, as a dependency, LoopStack’s carries only a binary of in size, as compared to LLVM (which defaults to more than ); this makes LoopNest an obvious choice for at least mobile and embedded systems.

  x86 based CPUs   Aarch64 (ARM) based CPUs (NEON)
  AMD (AVX2) Intel (AVX512)   Cortex A57 NVIDIA Denver2 Cortex A73 Apple M1
  LLVM LN Ratio LLVM LN Ratio  LLVM LN Ratio LLVM LN Ratio LLVM LN Ratio LLVM LN Ratio
CONV-1  86.767 87.638 1.01 72.943 162.48 2.2274  11.021 11.126 1.0095 10.968 14.782 1.3477 9.8466 9.3994 0.95458 98.121 97.51 0.99377
CONV-2  45.765 71.482 1.5619 13.813 156.78 11.35  11.276 11.179 0.99134 13.407 14.511 1.0824 9.8105 9.2468 0.94254 95.743 92.027 0.96119
CONV-3  9.4946 72.433 7.6289 96.448 184.4 1.9119  8.1811 10.183 1.2447 6.4069 13.27 2.0712 6.4671 7.9984 1.2368 53.681 83.252 1.5509
CONV-4  3.307 90.722 27.434 123.73 181.72 1.4687  5.3246 9.5687 1.7971 4.3611 12.532 2.8736 2.9514 7.686 2.6042 46.806 77.952 1.6654
DWCONV-1  48.695 62.541 1.2844 48 57.592 1.1998  6.3521 6.2343 0.98146 10.243 11.858 1.1577 3.9988 4.5019 1.1258 40.133 39.01 0.97201
DWCONV-2  39.203 53.465 1.3638 28.302 37.671 1.331  5.1196 5.7031 1.114 7.1075 7.9688 1.1212 2.5178 2.7729 1.1013 42.865 43.438 1.0134
DWCONV-3  62.873 84.848 1.3495 57.544 88.094 1.5309  6.9338 7.7209 1.1135 10.65 12.551 1.1785 5.7644 6.4571 1.1202 70.796 76.261 1.0772
DWCONV-4  77.071 84.21 1.0926 125.04 159.38 1.2747  9.6875 10.174 1.0502 12.665 14.717 1.1621 7.6075 8.0169 1.0538 81.588 80.901 0.99157
MM-64  85.808 102.2 1.191 144.67 187.65 1.2971  12.751 13.441 1.0541 14.039 15.754 1.1221 9.523 10.166 1.0675 91.18 199.81 2.1913
MM-128  92.692 102.5 1.1058 168.84 185.19 1.0969  12.417 12.496 1.0064 14.253 15.291 1.0728 9.2462 9.5062 1.0281 91.763 98.4 1.0723
MM-256  92.862 100.21 1.0791 170.18 182.46 1.0722  11.428 11.401 0.99761 14.241 14.645 1.0284 8.6863 8.6103 0.99125 95.597 99.902 1.045
MM-512  90.189 98.199 1.0888 160.42 159.59 0.9948  6.9971 8.3267 1.19 14.395 13.894 0.9652 6.3364 6.8905 1.0874 89.618 97.65 1.0896
Table 3. Meeasured execution performances, in GFLOPS, for code generatied by LLVM and LoopNest. LoopNest achieves comparable (within measurement error) or superior performances, while taking a fraction of time to generate the code.
Figure 7. Comparison of MKL-DNN, Scripted, and Manual tuning using LoopTool on an AMD EPYC 7R32 CPU core using the AVX2 vector extension.
Figure 8. Comparison of MKL-DNN, Scripted, and Manual tuning using LoopTool on an Intel(R) Xeon(R) Platinum 8124M CPU core using the AVX512 vector extension.
Figure 9. Comparison, in GFLOPS, of XNNPACK, Scripted, and Manual tuning using LoopTool on an ARM Cortex A57 CPU core using the neon vector extension.
Figure 10. Comparison, in GFLOPS, of XNNPACK, Scripted, and Manual tuning using LoopTool on an ARM Cortex A73 CPU core using the neon vector extension.
Figure 11. Comparison, in GFLOPS, of XNNPACK, Scripted, and Manual tuning using LoopTool on an NVIDIA ARM Denver 2 CPU core using the neon vector extension.
Figure 12. Comparison, in GFLOPS, of XNNPACK, Scripted, and Manual tuning using LoopTool on an Apple M1 CPU core using the neon vector extension.

7.3. Neural Network Workload: Benchmarking vs Traditional Libraries/Frameworks

Figure 3 presents a selection of operations common in modern neural networks. In the table (as well as figures 7,8,9,10,11,12) benchmarks prefixed with ”mm” are single matrix multiplications of varying size. The ”mlp” workloads simulate 3-layer multi-layer perceptrons of varying batch size and hidden dimensions. These include ReLU activations after each fully connected layer and no final reduction layer. The ”mobilenetV3” workloads are depth-wise separable cells taken from the MobileNetV3 model architecture (Howard_2019_ICCV) run with a fixed spatial dimension of 12x12 and varying channel sizes. All benchmarks are run in C++ against the most recent version of MKL-DNN (mkl) using PyTorch (pytorch) for convenience, as well as the most recent version of XNNPACK (xnnpack) through TFLite (abadi2016tensorflow).

We compared the performance of the best schedules generated by LoopStack using scripted tuning as described in Section 6.3. Additionally we compared performances of manually optimized schedules using the manual tuning TUI described in Section 6.2. The tuning was done by the authors by starting from the best schedules generated with the scripted tuning, and spending no more than a dozen of minutes per schedule.

Our results show that, our simple tuning script, implemented in less than 500 lines of Python code, outperforms hand–optimized primitives in most cases; additionally our manually optimized schedules outperform the hand–tuned primitives in all cases. LoopStack, including the scripted tuner as well as the TUI, is implemented in fewer than lines of code 222As measured with the standard cloc utility, whereas MKL-DNN and XNNPACK both have an order of magnitude more lines of code.

7.4. Extending LoopTool to New Hardware

Additionally, we perform a set of benchmarks on Apple’s new M1 chip using the neon_fp16 extension. Extending LoopStack to support fp16 computation required a set of changes in LoopNest compiler that are specific to the new target. This changes took less than 10 engineering-days, in in fewer than 1000 lines of code. In Figure 13 we show the results of scripted and manual tuning on Apple’s M1 chip, which is capable of around 210 GFLOPS peak performances. Our results suggest, that with a minimal amount of expert work, as compared to traditional, hand–optimized libraries, we can achieve extremely high utilization of new target hardware.

Figure 13. Results of LoopTool’s guided search, in GFLOPS, and manual optimizations using the TUI on Apple’s M1 chip (capable of around 210 GFLOPS) using neon_fp16 extension.

8. Related Work

Several alternative approaches have been proposed to efficiently execute tensor based computation on hardware. They fall in one of two broad categories: systems for high-performance tensor operations and scheduling of tensor operations.

High Performance Tensor Operations

Intel’s Math Kernel Library (MKL) (mkl) provides high performance kernels for common operations such as matrix-matrix multiplications, matrix-vector multiplications, matrix decompositions, and transformations. OneDNN (previously known as MKL-DNN) (onednn) similarly aims to provide high performance implementations of common neural network operators such as convolution, softmax, normalization, and others, which can be composed to implement performant pipelines. In contrast to these libraries, LoopNest does not maintain separate implementations for an enumeration of different operators and sizes, rather it provides an efficient code generation approach to support operations over arbitrary sizes/shapes. In doing so, LoopNest supports generating code for entire pipelines, rather than composing individual operators, which can be efficient independently but may otherwise result in subpar performance when composed.

Tensor-Contraction Code Generator (TCCG) (tccg) is a approach for performing tensor multiplication. To obtain high performance, TCCG draws on three separate approaches, and integrates them into a single tool that generates C++ code. It frames tensor multiplication as nested loops, transposition followed by use of an efficient GEMM implementation, and loops over GEMMs. In contrast to TCCG, LoopNest is not designed specifically for tensor-tensor multiplication (though it supports it), but rather provides support for additional operations and targets whole-pipeline code generation. While TCCG generates C++ code (which must then be compiled), LoopNest uses compilation to target x86 and aarch64. Finaly, TCCG folds in the search for a schedule, in that it generates varied implementations (using the different techniques underlying the tool), scores them with a performance model, and then compiles the most promising ones.

Polly (grosser2011polly) is a a polyhedral optimizer for LLVM code. Polly provides support for high-performance generalized matrix multiplication, which like LoopNest allows users to express computation with different multiplication and reduction operators. However, in contrast to LoopNest, Polly is not a standalone code generator and instead is designed to work as an optimization pass in LLVM. As we have shown in our evaluation, compiling code with LLVM is orders of magnitude slower than compiling using LoopNest.

Halide (ragan2013halide) and TVM (tvm) are compilers for general computational pipelines (with operators commonly used in image and tensor processing) and deep learning, respectively. Both allow users to implement high performance machine learning pipelines by using a declarative language to express tensor computations and a separate language for scheduling the execution of those computations. The scheduling languages support important primitives such as splitting, re-ordering, vectorizing, unrolling, or parallelizing loops, among other scheduling operations. After a computation has been scheduled, Halide and TVM use LLVM to generate executable CPU code. As a front-end, LoopTool’s use of two DSLs emulates that of Halide and TVM, and provides support for a subset of the scheduling operations. As an efficient compiler, LoopNest provides the complementary goal of efficient code generation. In particular, a goal of LoopNest is to become a viable CPU back-end alternative for systems such as Halide and TVM, enabling reduced compile times.

Schedule Exploration

Several techniques have been proposed (adams2019learning; value_learning) to automatically schedule Halide pipelines. These approaches represent a schedule as a collection of loop nests, which are recursively optimized from the output stage back to the input. At each step, they consider where to insert the storage and compute associated with that stage and they explore different tilings of the stage. To efficiently perform this exploration, they use search algorithms such as beam search, guided by a performance model. The performance model is a deep learning network trained to predict the runtime of a pipeline given a manually defined set of features describing the computation graph as well as the schedule.

AutoTVM (autotvm) uses simulated annealing to produce candidate schedules, which are obtained by applying multi-level loop tiling, loop reordering, caching sub-results, and annotating loops for unrolling and vectorizing. The simulated annealing procedure relies on a cost model as a scoring function. Their cost model is instantiated to be one of various tree-learning algorithms or a neural network, which embeds the schedule directly and applies a linear layer to predict runtime cost. Similarly to (adams2019learning) and  (value_learning), AutoTVM’s tree regressors rely on extracting lower-level features from the schedule such as memory access information.

Ansor (zheng2020ansor) uses a combination of techniques to automatically schedule TVM programs. Ansor’s leverages what they term “program sketching”, where higher level scheduling templates are automatically generated by iteratively applying rule-based expansions and lower level scheduling decisions are applied as annotations. Once a complete program is sampled (by applying random annotations to a sketch), its schedule is evolved using evolutionary search and a learned cost model to identify promising candidates. These promising candidates can then be benchmarked on hardware to obtain actual performance measurements, allowing Ansor to refine its cost model for following iterations of the search.

In contrast to these systems, LoopTool was designed to provide real-time assistance to experts manually exploring schedules. LoopTool’s feedback (e.g. arithmetic intensity, compile time, GFLOPs achieved) allows developers to quickly validate their intuitions. However, while LoopTool’s design focuses on manual search, we believe LoopTool’s feedback could be used by future automated search tools. In particular, automated search tools could (due to short compile times) evaluate many more candidates through LoopTool than otherwise possible. Underlying the fast responsiveness of LoopTool is LoopNest’s ability to generate high-performance code in orders of magnitude less time than traditional backends. In a similar direction, LoopNest’s simple and fast code generation, which transparently maps the schedule provided to code generated – eschewing traditional approaches using an intermediate representation and multiple compiler passes – could facilitate extracting more meaningful schedule features for learned cost models directly from the compiled code.

9. Conclusion

We presented LoopStack, a lightweight compiler designed from the ground up for deep learning workloads. By exclusively targeting programs that can be expressed using a generalized version of Einstein’s notation, LoopStack avoids most of the complexity inherent in general purpose compilers. This enables it to compile deep neural networks several orders of magnitude faster than alternatives approaches. This also results in a much smaller code base, and therefore a much more lightweight binary 333LoopStack is implemented in fewer than 5% lines as compared to MKL-DNN and XNNPACK, and its binary is 200x smaller than LLVM’s, which can be embedded in edge devices such as IoT, sensors, and so on. Furthermore, this clarity of purpose made it implement custom optimizations that really helps the performance of the generated code. Indeed, LoopStack has proven to be extremely competitive against state of the art implementations of deep neural networks on the extensive suite of benchmark we evaluated it on.

LoopTool also exposes its schedule feedback through a simple API, allowing downstream systems to consume information on a particular schedule and update that information as incremental schedule changes are issued. This model of interaction, paired with its fast compile/benchmarking, opens up the future possibility of using LoopTool in a reinforcement learning approach, where schedule operations are “actions” that update the problem “state” (loop structure), and feedback provided by LoopTool functions a “reward” (e.g. achieved GFLOPs per second).

References