Pure Tensor Program Rewriting via Access Patterns (Representation Pearl)

05/19/2021
by   Gus Henry Smith, et al.
0

Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and mapping to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be pure but high-level, making low-level rewrites to hardware targets inexpressible, or low-level but impure, hampering the use of term rewriting altogether. This paper introduces Glenside, a pure IR whose core abstraction – the access pattern – enables low-level, layout-aware, hardware-centric program rewrites. We demonstrate how term rewriting in Glenside can be used to map program fragments to hardware accelerator invocations and automatically discover classic data layout transformations like . Glenside establishes a new foundation for exploring further term rewriting techniques in optimizing low-level tensor programs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

02/27/2021

HIR: An MLIR-based Intermediate Representation for Hardware Accelerator Description

The emergence of machine learning, image and audio processing on edge de...
04/10/2021

Joint Program and Layout Transformations to enable Convolutional Operators on Specialized Hardware based on Constraint Programming

The success of Deep Artificial Neural Networks (DNNs) in many domains cr...
03/14/2019

Stripe: Tensor Compilation via the Nested Polyhedral Model

Hardware architectures and machine learning (ML) libraries evolve rapidl...
06/08/2021

LLAMA: The Low-Level Abstraction For Memory Access

The performance gap between CPU and memory widens continuously. Choosing...
04/04/2020

The Collection Virtual Machine: An Abstraction for Multi-Frontend Multi-Backend Data Analysis

Getting the best performance from the ever-increasing number of hardware...
03/22/2022

NNReArch: A Tensor Program Scheduling Framework Against Neural Network Architecture Reverse Engineering

Architecture reverse engineering has become an emerging attack against d...
02/27/2019

Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures

The ubiquity of accelerators in high-performance computing has driven pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Four access patterns, representing different ways a tensor program (or kernel

) might access the same 3D tensor. For example, (c) represents accessing a 3D tensor as a vector of 2D matrices.

Machine learning (ML) and other high-performance computing (HPC) applications increasingly rely on specialized hardware accelerators to provide speed and energy efficiency (Jouppi et al., 2017; Krizhevsky et al., 2012; Reuther et al., 2019). This trend has highlighted the need for flexible accelerator support in domain-specific compilers like Halide (Ragan-Kelley et al., 2013a), TVM (Chen et al., 2018a)

, TensorFlow/MLIR 

(Abadi et al., 2016; Lattner et al., 2020)

, and PyTorch 

(Paszke et al., 2019).

Adding accelerator support to an existing compiler typically uses custom pattern matching to map expensive tensor operations from applications down to accelerator invocations 

(Yang et al., 2020; Chen and Yu, 2020). Pattern matching often additionally relies on various other transformations to canonicalize intermediate representations (IRs) and massage data layouts into formats matching accelerator requirements (NVIDIA, 2020). Even with these changes, users may need to manually modify their application to help the compiler discover opportunities for dispatching operations to accelerators, such as by changing data types or unrolling loops.

In principle, term rewriting techniques (Baader and Nipkow, 1998) should be able to facilitate many of these transformation and mapping tasks within a compiler. Halide and TVM already rely on extensive rewrite systems for optimizing scalar computations and simplifying loop bounds in order to support further downstream optimizations (Newcomb et al., 2020; Hagedorn et al., 2020).

Unfortunately, existing IRs in compilers for array/tensor programming DSLs tend to present abstraction and granularity mismatches that hamper term rewriting approaches. Term rewriting is most easily applied in pure (side effect–free) IRs that support equational reasoning. At the same time, mapping to accelerators requires considering low-level hardware details like data layout. Existing pure IRs for ML frameworks are used primarily for high-level transformations (e.g., type elaboration and inlining) and do not expose low-level data layout details (Roesch et al., 2019). On the other hand, IRs used for crucial lower-level optimizations like operator fusion must support precise reasoning about memory use, and therefore are typically impure, hampering term rewriting.

To help mitigate such impedance mismatches, we present Glenside,111Publicly available at https://github.com/gussmith23/glenside. a pure tensor program IR that enables hardware-level term rewriting. Glenside is based on a simple access pattern abstraction that supports expressing and reasoning about data layout transformations via syntactic rewrite rules. When combined with standard arithmetic rewrites for per-tensor-element computations, access patterns enable implementing complex transformations for accelerator support as compositions of simple rewrites.

Tensors are traditionally characterized by their shape, an -tuple of positive integers indicating the size of each of a tensor’s dimensions. Access patterns instead characterize each tensor with two shapes, e.g., , , separating the dimensions which are iterated over from the dimensions which are computed on. Figure 1(c) depicts an example where a 3D tensor’s first dimension is iterated over and some computation applied to each corresponding 2D matrix.

We demonstrate how Glenside enables implementing representative hardware-level transformation via term rewriting, including mapping computations to systolic arrays (Jouppi et al., 2017) (a common hardware module in ML accelerators) and automatically discovering the im2col data layout transformation (Chellapilla et al., 2006), which enables mapping 2D convolutions to matrix multiplication hardware. In particular, by employing equality saturation (Willsey et al., 2021), these transformations “fall out for free” (i.e., without any carefully crafted rewrite orderings (Whitfield and Soffa, 1997)), from a handful of general rewrites concerning tensor transposition, Cartesian product, dot product, etc., expressed in terms of access patterns.

To summarize, our contributions include:

  • Access patterns, a tensor representation that employs a simple, extended tensor shape type to distinguish iteration and computation dimensions

  • The Glenside IR, a pure compiler IR that facilitates term rewriting to enable support for specialized accelerators

  • A library of generic rewrites over Glenside programs

  • Case studies demonstrating how Glenside enables automatically discovering key transformations for mapping applications to custom accelerators via equality saturation with the egg (Willsey et al., 2021) library.

The rest of the paper is organized as follows: Section 2 provides background and briefly surveys closely related work. Section 3 motivates Glenside via a running example exploring pure matrix multiplication. Section 4 details the design and implementation of Glenside. Section 5 details case studies showing the potential benefits of Glenside’s term rewriting approach to low-level tensor program transformations.

2. Background and Related Work

Glenside is designed to help target tensor hardware accelerators and builds on past work in tensor IRs and term rewriting.

2.1. Machine Learning Accelerators

A variety of accelerators (Jouppi et al., 2017; Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne, 2016; Moreau et al., 2019; Markidis et al., 2018; Nvidia, 2018) have been developed to provide efficient implementations of tensor operators for ML applications. These devices accelerate tensor operators through hardware parallelism, simultaneously applying related operations across many tensors in the accelerator’s memory (which are often laid out according to custom rules that facilitate hardware optimization). Tensor program compilers must translate expensive application code fragments down to accelerator invocations that adhere to these layout rules, which often involves both (a) higher-level transformations like tensor reshaping to match accelerator size bounds and loop unrolling to expose optimization opportunities, and (b) lower-level transformations like operator fusion and im2col to match accelerator calling conventions and even implement different operations using the same accelerator, e.g., on systolic arrays (Chellapilla et al., 2006; Jia, 2014).

2.2. Tensor IRs and Compilers

Tensor compilers for ML and HPC applications strive to balance clear, high-level operator semantics and support for the low-level optimizations necessary to target specialized accelerators. Halide (Ragan-Kelley et al., 2013b) achieves this balance by separating operator specifications (what is computed) from schedules (how, when, and where each output element is generated). This style of separation has proven highly effective across both application domains and hardware targets; numerous compilers including TVM (Chen et al., 2018b), FireIron (Hagedorn et al., 2020), LIFT (Steuwer et al., 2017), and Accelerate (Chakravarty et al., 2011) follow variations of this strategy.

The specification/schedule separation approach allows the same high-level program (specification) to be flexibly optimized for and mapped to different hardware targets by applying different schedules. From this perspective, schedules represent different rewriting strategies to explore various loop ordering and memory layouts; in LIFT and Accelerate these take the form of functional combinators closely related to Glenside’s approach. As in classic term rewriting, experts must often carefully craft schedules for each target to achieve the best performance and mitigate phase ordering challenges (Whitfield and Soffa, 1997), though recent projects have produced promising results in automatic scheduling (Chen et al., 2018c; Zheng et al., 2020; Anderson et al., 2020).

Other tensor IRs like TACO (Kjolstad et al., 2017), Keops (Charlier et al., 2021), and COMET (Tian et al., 2021) rely on index notation222 Index notation is closely related to “Einstein notation” where reduction indices are implicit. to concisely express tensor operators and simplify optimization by uniformly representing per-output-element computations. These approaches also rely on rewriting passes to generate kernel implementations specialized to tensor sparsity / density, operator combinations arising in the source application, and details of the target hardware. In Section 3 we discuss some of the tradeoffs of these approaches with respect to other rewriting strategies.

Finally, polyhedral compilers (Simbürger et al., 2013) like Tensor Comprehensions (Vasilache et al., 2018) and Tiramisu (Baghdadi et al., 2019) optimize loop-filled programs by modeling loop nests as polyhedra and applying geometric transformations. The polyhedral approach exploits regular loop structure, but is also restricted to geometrically affine transformations. In contrast, term rewriting is neither guided nor restricted by geometric constraints, making these approaches broadly complementary.

2.3. Term Rewriting and Equality Saturation

Term rewriting is a classic program optimization technique (Baader and Nipkow, 1998) that relies on iteratively applying rewrite rules of the form : when part of a program matches the pattern under substitution , it is rewritten into . This approach is ubiquitous, frequently used in both mainstream and DSL compilers to implement features including preprocessing, strength reductions, and peephole optimizations (Garavel et al., 2018).

Classic term rewriting systems where rewrites are applied destructively suffer phase ordering problems (Whitfield and Soffa, 1997): the order in which rewrites are applied can enhance or severely diminish performance. Recent work has shown how program synthesis can help address this challenge in peephole optimizers like Halide’s scalar expression rewriter (Newcomb et al., 2020).

Advances in alternate rewriting techniques like equality saturation (Tate et al., 2009) also mitigate phase ordering by exploiting the e-graph data structure from SMT solvers to repeatedly apply all rewrites simultaneously, thus obviating rule ordering considerations. In particular, the egg library (Willsey et al., 2021) has been used to develop new synthesis and optimization tools across diverse domains (Panchekha et al., 2015; Nandi et al., 2020; Wang et al., 2020), including DSP compiler vectorization (VanHattum et al., 2021) and tensor computation graphs (Yang et al., ).

Glenside provides the first tensor IR amenable to equality saturation by introducing access patterns to provide pure, higher order tensor kernel combinators that support rank-polymorphism without the need for binding structures like anonymous functions or index notation.

3. From Pure matMul to IR Design Goals

Applying functional programming techniques and term rewriting to tensor IRs requires careful design. For example, we must ensure that operators be compositional with respect to tensor shapes and that the representation support generic rules within the target rewrite engine. To highlight such constraints and motivate access patterns in Glenside, this section illustrates potential pitfalls with a simple matrix multiplication example.

3.1. Pure Matrix Multiplication

We write f64 for the type of 64-bit floats and [A] for vectors over type A. Using this notation, we can specify operators like dot product and 2D matrix transpose as:

dotProd : [f64] * [f64] -> f64
trans2 : [[f64]] -> [[f64]]

Implementing 2D matrix multiplication on inputs and requires computing an output matrix where . The need to compute dotProd for every pair of a row from and a column from suggests map and Cartesian product operators which we might specify with:

map : (A -> B) * [A] -> [B]
cartProd : [A] * [B] -> [A * B]

Naively, we can almost implement matrix multiplication as:

matMul(P, Q) :=

However, the result type will have been flattened to just [f64], making it impossible to compose with other matrix operators that expect [[f64]] inputs.

Our first problem is that the cartProd specification above “forgets” the shape of its arguments. We could change this specification to arrange the output as a matrix:

cartProd2D : [A] * [B] -> [[A * B]]

But this result type prevents directly mapping dotProd.333 This simple type does not specify how cartProd2D orders its output relative to its input vectors. We assume the order expected for matrix multiplication. Now the problem is that map only applies a computation by iterating over the first (outermost) dimension of a tensor. If we specialize map to iterate over the second dimension:

mapAt2 : (A -> B) * [[A]] -> [[B]]

then we can implement a compositional matMul operator that correctly produces results of type [[f64]] as:

matMul(P, Q) :=

3.2. Glenside Design Constraints and Goals

This style of pure, higher-order functional program representation enables term rewriting and equational reasoning via rules like:

dotProd(P, Q)
trans2(trans2(P))
map(f, map(g, P))
mapAt2(f, trans2(P))

However, some of these rules depend on the shapes of dimension-specific operators aligning. What happens when we need to support higher-dimensional tensors? Without a mechanism to abstract which dimensions of a tensor are being iterated as opposed to computed over, we would have to generate versions of each rule for every combination of dimensions. Worse, these problems do not only affect rewrite rules; they also lead to code blowup just to specify all the variants of tensor kernels that arise in practice.

One strategy to address these challenges is adding support for anonymous functions (“lambdas”), currying, and closures to the tensor program representation. These features can provide sufficient flexibility to handle shape alignment issues that otherwise may require dimension-specific operators like cartProd2D and mapAt2 above. For example, given curried versions of dotProd and map, we could have used such features to implement a curried matMul as:

matMul’ P Q :=

Alternatively, some IRs rely on index notation for even pithier implementations like:

matMul(P,Q)[i,j] := dotProd(P[i], trans2(Q)[j])

Unfortunately, these approaches all rely on some form of name binding which can significantly complicate term rewriting. Rewriting under binders, whether explicitly in the form of lambdas or implicitly with index notation, requires additionally analyzing the potential contexts (what names are bound to) of every subexpression. While it is still technically possible to apply state-of-the-art rewrite engines like egg (Willsey et al., 2021) via explicit variable substitution rules and free variable analyses, we have found the additional complexity and rewrite search space blow up substantially eliminate the potential advantages of term rewriting in such IR designs.

All the above constraints inform Glenside’s key design goal: providing an IR that flexibly supports specifying and composing higher-order tensor operators444 As map and mapAt2 in Section 3.1 illustrate, an IR can support higher-order operators without necessarily providing lambdas, currying, or closures. over arbitrary dimensions while still enabling high-performance term rewriting techniques like equality saturation. In the rest of this paper, we show how access patterns enable achieving these goals with a focus on applications to mapping application fragments down to specialized hardware accelerators.

4. Glenside

Transformer Input(s) Output Shape
access , and non-negative integer ,
transpose , , (a permutation of ) ,
cartProd , , , ,
windows , ,
window shape

, strides

, ,
where
slice , ,
dimension index , bounds
,
with except
squeeze , , index where , with removed
flatten , ,
reshape , ,
access pattern shape literal ,
, ,
if and
pair two access patterns of shape , ,
Table 1. Glenside’s access pattern transformers.

This section details Glenside’s implementation, focusing on its core abstraction, access patterns. We use Section 3’s matrix multiplication as a running example throughout.

4.1. Access Patterns

Access patterns encode common tensor IR patterns where some tensor dimensions are iterated over (accessed) while others are computed on.555 This is similar to NumPy’s concept of universal functions. Section 3’s matMul example iterates over dimension 0 of input , while computing on dimension 1, effectively viewing as a 1D vector of 1D vectors.

Access patterns are specified by their shape — a pair of tuples of positive integers . An access pattern of shape is, in turn, a tensor whose shape is given by the concatenation of the access pattern shape tuples ; we refer to and as the access and compute dimensions of , respectively.

Access patterns represent the view of an –dimensional tensor as a tensor of shape , each of whose elements has shape . For an access pattern of shape where , we use the syntax (access ) to represent in Glenside. For example, if a 2D matrix has shape , then the Glenside expression (access 1) yields an access pattern of shape .

The matrix multiplication example from Section 3 directly accesses the rows of , but uses trans2 to iterate over the columns of . Instead of requiring an explicit transpose operator, Glenside provides access pattern transformers.

4.2. Access Pattern Transformers

Access pattern transformers manipulate one or more access patterns to produce a new access pattern, allowing Glenside to support more complex patterns like slicing, transposing, and interleaving. Table 1 lists Glenside’s transformers.

To produce an access pattern representing the columns of for matrix multiplication, we employ the transpose transformer. It takes an access pattern and a list of dimension indices, and rearranges the dimensions of the access pattern in the order specified by the indices. If has shape , (transpose (access 1) (list 1 0)) produces an access pattern of shape , .

The cartProd transformer takes access patterns of shapes , and , respectively, and produces an access pattern of the shape , , where represents a 2-tuple of the input access patterns’ compute dimensions. The access dimensions of the input access patterns are simply concatenated. In the matrix multiplication example, the Cartesian product of the rows of with the columns of is an access pattern of shape , , where the second shape represents a 2-tuple of a row from with a column from .

We have nearly re-implemented matrix multiplication example in Glenside. The final step is to implement the dot product, for which Glenside uses access pattern operators.

4.3. Access Pattern Operators

Operator Type Description
reduceSum sum values
reduceMax max of all values
dotProd eltwise mul; sum
Table 2. Glenside’s access pattern operators.

Operators are the only Glenside constructs which perform computation. They are invoked only in compute expressions, which map the operator over the compute dimensions of an access pattern. For an input access pattern of shape , , and an operator with type , the result of (compute ) will have shape , ; that is, a compute expression cannot change the access dimensions of the input access pattern. Table 2 lists the operators in Glenside.

Recall where we are in converting our matrix multiplication example: we have accessed the rows of and the columns of and taken their Cartesian product, resulting in an access pattern of shape , , and we need now to compute the dot product of these row-column pairs. In Glenside, the dotProd operator (see Table 2) does just that. To compute the dot product over our row-column pairs, we need only to apply compute dotProd to our access pattern, to produce an access pattern with final shape , . The entire Glenside specification of matrix multiplication is shown in Figure 2.

(transpose                   ;   ,
 (squeeze                    ;   ,
  (compute dotProd           ;   ,
   (cartProd                 ;   ,
    (windows                 ;   ,
     (access activations 1)  ;   ,
     (shape C Kh Kw)
     (shape 1 Sh Sw))
    (access weights 1)))     ;   ,
  1)
 (list 0 3 1 2))
2D convolution.
(compute dotProd          ;   ,  (cartProd                ;   ,   (access activations 1)  ;   ,   (transpose              ;   ,    (access weights 1)     ;   ,    (list 1 0)))) Matrix multiplication. (compute reduceMax       ; ,  (windows                ; ,   (access activations 2) ; ,   (shape Kh Kw)   (shape Sh Sw))) Max pooling.
Figure 2. Common tensor kernels from machine learning expressed in Glenside. Lines containing access patterns are annotated with their access pattern shape. is batch size; / are spatial dimension sizes; / are input/output channel count; / are filter height/width; / are strides.

5. Case Studies

To demonstrate Glenside’s utility, we first show how it enables concise specifications of several critical ML kernels (Section 5.1). We then show how Glenside’s pure, binder-free representation enables mapping kernels to an example accelerator via direct application of generic rewrite rules (Section 5.2). Finally, we highlight how Glenside enables the flexible mapping of larger, more diverse kernels to our accelerator, utilizing the power of equality saturation to automatically discover a variety of program transformations. Specifically, we show how Glenside can automatically map convolutions to matrix multiplications (Section 5.3) and automatically map large matrix multiplications into a sequence of smaller matrix multiplications (Section 5.4).

5.1. Representation of Common ML Kernels

Figure 2 lists the Glenside specifications of three common ML kernels: 2D convolution, matrix multiplication, and max pooling. Below, we discuss the specifications of 2D convolution and max pooling; see Section 4 for a description of matrix multiplication.

2D Convolution

2D convolution (conv2d

) is a core kernel in deep learning, defined element-by-element over tensors storing activations

, strides , and weights as:

where indexes the output batch, indexes output channels, / index spatial dimensions, / index the convolutional window spatial dimensions, and indexes input channels. 2D convolution slides each of the filters of shape through each possible –shaped window of the input images. At each of these locations, an elementwise multiplication and reduction sum is computed.

The Glenside specification of conv2d is shown in Figure 2. We access the weights as a vector of filters and the activations as a vector of images. We leave the filters as they are, but form windows of shape over the activations using the windows access pattern transformer (Table 1). This produces an access pattern of shape , , i.e., a batch of “images” of new spatial shape , where every location is a window of the original input. Finally, we take the Cartesian product of the filters and the windows, compute their dot product, and squeeze and transpose the output into the correct layout.

Max Pooling

Max pooling, commonly used in ML to condense intermediate activations, is defined as:

Max pooling slides a window of shape over all possible locations within the spatial (i.e.,  and ) dimensions. At each window location, it reduces the window to a scalar with the operator. The Glenside specification merely applies reduceMax over each two-dimensional window.

Discussion

Glenside separates the computation from the data access patterns in these kernels while exposing the simplicity of their computation—and the relative complexity of their data access. In all three kernels, the computation can be described with a single operator; most of the specification entails setting up the data access pattern.

Furthermore, Glenside exposes similar structure between kernels; for example, both conv2d and matrix multiplication feature the expression (compute dotProd (cartProd ...)). At their core, these kernels are performing the same computation, but with different patterns of data access. In Section 5.3, we exploit this similarity in structure when mapping kernels to hardware.

These kernels highlight the expressive power of access patterns. Consider the use of windows in conv2d and max pooling. Both kernels form windows differently: conv2d forms three-dimensional windows over the channels, height, and width dimensions, while max pooling forms two-dimensional windows over the height and width. Rather than passing configuration parameters to windows, Glenside attaches this information to the tensors themselves.

(compute dotProd (cartProd ?a0 ?a1)) 
  (systolicArray ?rows ?cols
    ?a0 (access (transpose ?a1 (list 1 0)) 0))
where ?a0 is of shape
  and ?a1 is of shape
Figure 3. Our rewrite rewriting matrix multiplication to a systolic array invocation.
                                                    ?a  (reshape (flatten ?a) ?shape)
(cartProd (reshape ?a0 ?shape0) (reshape ?a1 ?shape1))  (reshape (cartProd ?a0 ?a1) ?newShape)
                 (compute dotProd (reshape ?a ?shape))  (reshape (compute dotProd ?a) ?newShape)
Figure 4. Rewrites enabling the discovery of the im2col transformation.

5.2. Mapping matMul to Accelerators

Glenside can be used to uncover opportunities to invoke accelerator components. Consider a weight-stationary systolic array, a common matrix multiplication architecture. A weight-stationary systolic array with rows and columns takes two lists of length- vectors (the activations and weights, respectively), pairing each vector from one list with each vector from the other, and computes a dot product over each pair. The second list contains vectors, while the first can be of any length.

Glenside’s purity allows us to implement this hardware mapping task using a term rewriting system, in which we rewrite a matching program pattern to an invocation of our systolic array. Our rewrite is shown in Figure 3, mimicking egg’s rewrite syntax. Tokens starting with a question mark (such as ?a0 in Figure 3) are variables in the pattern, bound by the left-hand side (LHS), and then used on the right-hand side (RHS). egg also allows for conditions on rewrites, which we print below our rewrites.

To design our rewrite, we first must design the LHS to match program patterns that resemble the data access pattern and compute pattern of our systolic array. Glenside is eminently suitable for this task, as it can express exactly the data access and computation pattern we described for the systolic array. Pairing all vectors from one list with all vectors from another and computing the dot product of the pairs is represented as (compute dotProd (cartProd ?a0 ?a1)), binding ?a0 and ?a1 to the input access patterns. We encode the systolic array’s constraints on the input shapes as a condition on the rewrite. Patterns which match the LHS are mapped to the RHS; in this case, we introduce a new systolicArray construct to represent the functioning of our systolic array. The shape of the systolic array is given by the ?rows and ?cols parameters, and the inputs are given as access patterns. Note how we also transform the second access pattern to more accurately convey how the actual systolic array hardware accesses the weight tensor: it reads it all at once (hence, (access ... 0)), and expects it to be laid out in transposed form in memory. This added information—enabled by Glenside’s access patterns—provides richer data layout information, potentially helping future rewrites or code generation steps.

5.3. Flexible Mapping: Discovering im2col

(transpose
 (squeeze
  (reshape           ;   ,
   (compute dotProd  ;   ,
    (cartProd
     (flatten        ;   ,
      (windows (access activations 1)
               (shape C Kh Kw) (shape 1 Sh Sw)))
     (flatten        ;   ,
      (access weights 1))))
   ?shape) 1) (list 0 3 1 2))
Figure 5. An im2col-transformed conv2d, after the application of the rewrites in Figure 4 and just before the application of the systolic array rewrite.

The im2col transformation is a data layout optimization which enables computing conv2d on matrix multiplication hardware. The transformation involves instantiating the convolutional windows over the input activations directly in memory (Chellapilla et al., 2006). This leads to data duplication, but the resulting speedup more than offsets that overhead. In this case study, we show how a few general rewrites within Glenside lead to the automatic rederivation of the im2col transformation.

Glenside’s representation underscores the structural similarity between conv2d and matrix multiplication, reflected also by the shared (compute dotProd (cartProd ...)) between conv2d and the LHS of the systolic array rewrite in Figure 3. Using this rewrite on conv2d would permit mapping it to the systolic array; however, the restrictions on the shape of ?a0 and ?a1 prevent its application. The systolic array has an activation access pattern of shape , and a weight access pattern of shape , , while conv2d operates over access patterns of shape , and of , , respectively. Transforming the access pattern into a lower-dimensional form would enable the systolic array rewrite.

Figure 4 shows the rewrites which enable this transformation. We call the first rewrite an exploratory rewrite as it optimistically matches any access pattern expression. It flattens an access pattern and immediately reshapes it back to its original shape, thus preserving equality (see Table 1 for formal definitions). This exploratory rewrite introduces the flattening necessary to convert the higher-dimensional access patterns of conv2d into the access patterns matched by the systolic array rewrite. However, the reshape operator will still need to be moved before we can fire Figure 3’s systolic array rewrite. The second and third rewrites in Figure 4 take care of this; they implement composition commutativity of reshape with cartProd and compute dotProd, which “bubble” reshape operators up and out of expressions. These rewrites express general properties of these operators and are not specific to this task.

These three rewrites work in concert to map conv2d to a systolic array. First,666 Since equality saturation explores rewrites non-destructively, the rewriting order here is purely for explanatory purposes. the exploratory rewrite flattens and reshapes all access pattern expressions. This includes the inputs to conv2d’s cartProd subexpression, which are flattened to shapes , and , and reshaped back to their original shapes. Next, the composition commutativity rewrites for cartProd and compute dotProd fire in sequence, bubbling the reshape up through the cartProd and dotProd expressions (shown in Figure 5). Finally, the systolic array rewrite completes the im2col transform. Glenside’s equality saturation based rewrite engine discovers these rewrites because the exploratory rewrite fires on every term and no rewrites are missed due to the absence of phase ordering.

This example highlights how, with straightforward, generally applicable rewrites defined over Glenside, equality saturation can emergently discover useful transformations that previously required expert insights to apply.

5.4. Flexible Mapping: matMul Blocking

?a  (concat (slice ?a ?dim ?b0 ?b1)
               (slice ?a ?dim ?b1 ?b2) ?dim)
(cartProd ?a (concat ?b0 ?b1 ?dim)) 
 (concat (cartProd ?a ?b0) (cartProd ?a ?b1) ?newDim)
if ?dim is an access dimension
(cartProd (concat ?a0 ?a1 ?dim0)
          (concat ?a2 ?a3 ?dim1)) 
 (concat (cartProd ?a0 ?a2)
         (cartProd ?a1 ?a3) ?newDim)
if ?dim0 and ?dim1 are the same shape dimension
(compute dotProd (concat ?a0 ?a1 ?dim)) 
 (concat (compute dotProd ?a0)
         (compute dotProd ?a1) ?dim)
if ?dim is an access dimension
(compute dotProd (concat ?a0 ?a1 ?dim)) 
 (compute reduceSum (pair (compute dotProd ?a0)
                          (compute dotProd ?a1)))
if ?dim is a shape dimension
Figure 6. Rewrites for blocking matMul.
(concat                          ;   ,
 (concat                         ;   ,
  (compute reduceSum             ;   ,
   (pair                         ;   ,
    (compute dotProd             ;   ,
     (cartProd                   ;   ,
      (slice                     ;   ,
       (slice (access activations 1) 0 0 16) 1 0 16)
      (transpose                 ;   ,
       (slice
        (slice (access weights 1) 0 0 16) 1 0 16)
       (list 1 0))))
    (compute dotProd             ;   ,
     (cartProd                   ;   ,
      (slice                     ;   ,
       (slice (access activations 1) 0 16 32) 1 0 16)
      (transpose                 ;   ,
       (slice                    ;   ,
        (slice (access weights 1) 0 16 32) 1 0 16)
       (list 1 0)))))) ...
Figure 7. A matMul blocked into matMuls. Only two of the eight total multiplications are shown.

Equality saturation can also be used with Glenside to emergently discover a matrix multiplication blocking scheme. Matrix multiplication blocking is the common strategy of breaking up a single, large matrix multiplication into smaller multiplications, by multiplying subsets of the input matrices and assembling the results to form the output matrix. This is essential in practice, as systolic arrays are small (often between and ) while matrices in ML and HPC applications can be much larger.

As in Section 5.3, this transformation follows from an exploratory rewrite and associated “cleanup” rewrites. The exploratory rewrite used for blocking is shown at the top of Figure 6. Given an access pattern, this rewrite slices the access pattern into two pieces along a dimension and then concatenates them back together. The dimension as well as the division strategy are configurable. For this example, we assume for simplicity that we run this rewrite on every available dimension, that we divide each dimension perfectly in half, and that all dimensions are powers of 2 in size. Figure 6 gives rewrites for bubbling the introduced concat operators up through the expression, namely the compositional commutativity of concat with cartProd and compute dotProd. Starting from the matrix multiplication in Figure 2, assuming input shapes of , the exploratory rewrite first slices and concatenates the access patterns at the input of cartProd. Then, using the commutativity rewrites, the resulting concats are bubbled up to produce the final expression in Figure 7. The effect of these rewrites is that the single matMul becomes eight separate matMuls, which are summed and concatenated to form the full output matrix. This case study demonstrates yet again that Glenside’s expressiveness allows a small set of rewrites to produce interesting and useful emergent transformations.

6. Conclusion

In this paper, we proposed access patterns as an abstraction to enable equality saturation style term rewriting for low-level tensor program mapping to hardware accelerators. Crucially, access patterns support specifying and composing higher-order, arbitrary dimension tensor operators without the need for binding structures like anonymous functions or index notation. We demonstrated the potential utility of access patterns in the Glenside IR through case studies showing how rewrites in Glenside can automatically uncover common layout transformations like im2col used for accelerator mapping. We are excited for the community to join in further exploring the potential applications of access patterns and to build additional optimizations on Glenside’s foundations.

Acknowledgements.
This work was sponsored by the Sponsor Real Time Machine Learning (RTML) Rlhttps://www.darpa.mil/program/real-time-machine-learning DARPA project, and the Sponsor Applications Driving Architectures (ADA) Rlhttps://adacenter.org/ and Sponsor Center for Research in Intelligent Storage and Processing in Memory (CRISP) Rlhttps://crisp.engineering.virginia.edu/ JUMP Research Centers, co-sponsored by SRC and DARPA. We thank Chandrakana Nandi for her extensive and enthusiastic editing. We also thank Sorawee Porncharoenwase, Luis Vega, and Max Willsey for their helpful comments.

References

  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. External Links: Link Cited by: §1.
  • L. Anderson, A. Adams, K. Ma, T. Li, and J. Ragan-Kelley (2020) Learning to schedule halide pipelines for the gpu. External Links: 2012.07145 Cited by: §2.2.
  • F. Baader and T. Nipkow (1998) Term rewriting and all that. Cambridge University Press. External Links: Document Cited by: §1, §2.3.
  • R. Baghdadi, J. Ray, M. B. Romdhane, E. D. Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, and S. Amarasinghe (2019) Tiramisu: a polyhedral compiler for expressing fast and portable code. 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). External Links: ISBN 9781728114361, Link, Document Cited by: §2.2.
  • M. M. T. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover (2011) Accelerating Haskell array codes with multicore GPUs. Cited by: §2.2.
  • B. Charlier, J. Feydy, J. A. Glaunès, F. Collin, and G. Durif (2021) Kernel operations on the gpu, with autodiff, without memory overflows. Journal of Machine Learning Research 22 (74), pp. 1–6. External Links: Link Cited by: §2.2.
  • K. Chellapilla, S. Puri, and P. Simard (2006)

    High Performance Convolutional Neural Networks for Document Processing

    .
    La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §1, §2.1, §5.3.
  • T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy (2018a) TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 578–594. External Links: ISBN 978-1-931971-47-8, Link Cited by: §1.
  • T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. (2018b) tvm: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594. Cited by: §2.2.
  • T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy (2018c) Learning to optimize tensor programs. USA, pp. 3393–3404. Cited by: §2.2.
  • Z. Chen and C. Yu (2020) How to bring your own codegen to tvm. Apache Foundation. Note: https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm Cited by: §1.
  • Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne (2016)

    Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

    .
    pp. 262–263. Cited by: §2.1.
  • H. Garavel, M. Tabikh, and I. Arrada (2018) Benchmarking Implementations of Term Rewriting and Pattern Matching in Algebraic, Functional, and Object-Oriented Languages - The 4th Rewrite Engines Competition. Thessaloniki, Greece. External Links: Link Cited by: §2.3.
  • B. Hagedorn, A. S. Elliott, H. Barthels, R. Bodik, and V. Grover (2020) Fireiron: a scheduling language for high-performance linear algebra on gpus. External Links: 2003.06324 Cited by: §2.2.
  • B. Hagedorn, J. Lenfers, T. Kundefinedhler, X. Qin, S. Gorlatch, and M. Steuwer (2020) Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies. Proc. ACM Program. Lang. 4 (ICFP). External Links: Link, Document Cited by: §1.
  • Y. Jia (2014) Learning semantic image representations at a large scale. Ph.D. Thesis, EECS Department, University of California, Berkeley. External Links: Link Cited by: §2.1.
  • N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon (2017) In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45 (2), pp. 1–12. External Links: ISSN 0163-5964, Link, Document Cited by: §1, §1, §2.1.
  • F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe (2017) The tensor algebra compiler. Proc. ACM Program. Lang. 1 (OOPSLA), pp. 77:1–77:29. External Links: ISSN 2475-1421, Link, Document Cited by: §2.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. . Cited by: §1.
  • C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko (2020) MLIR: a compiler infrastructure for the end of moore’s law. External Links: 2002.11054 Cited by: §1.
  • S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter (2018) NVIDIA tensor core programmability, performance & precision. 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). External Links: ISBN 9781538655559, Link, Document Cited by: §2.1.
  • T. Moreau, T. Chen, L. Vega, J. Roesch, L. Zheng, E. Yan, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy (2019) A hardware-software blueprint for flexible deep learning specialization. IEEE Micro (), pp. 1–1. External Links: Document, ISSN 0272-1732 Cited by: §2.1.
  • C. Nandi, M. Willsey, A. Anderson, J. R. Wilcox, E. Darulova, D. Grossman, and Z. Tatlock (2020) Synthesizing structured cad models with equality saturation and inverse transformations. New York, NY, USA, pp. 31–44. External Links: ISBN 9781450376136, Link, Document Cited by: §2.3.
  • J. L. Newcomb, A. Adams, S. Johnson, R. Bodik, and S. Kamil (2020) Verifying and improving halide’s term rewriting system with program synthesis. Proc. ACM Program. Lang. 4 (OOPSLA). External Links: Link, Document Cited by: §1, §2.3.
  • Nvidia (2018) The nvidia deep learning accelerator (nvdla). Note: http://nvdla.org/ Cited by: §2.1.
  • NVIDIA (2020) Convolutional layers user guide. Nvidia. Note: https://docs.nvidia.com/deeplearning/performance/dl-performance-convolutional/index.html Cited by: §1.
  • P. Panchekha, A. Sanchez-Stern, J. R. Wilcox, and Z. Tatlock (2015) Automatically improving accuracy for floating point expressions. 50 (6). External Links: ISSN 0362-1340, Link, Document Cited by: §2.3.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. External Links: 1912.01703, Link Cited by: §1.
  • S. Qin, L. Klaaßen, U. Gallersdörfer, C. Stoll, and D. Zhang (2020) Bitcoin’s future carbon footprint. External Links: 2011.02612 Cited by: Broader Impact Statement.
  • J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe (2013a) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, New York, NY, USA, pp. 519–530. External Links: ISBN 978-1-4503-2014-6, Link, Document Cited by: §1.
  • J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe (2013b) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48 (6), pp. 519–530. Cited by: §2.2.
  • A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner (2019) Survey and benchmarking of machine learning accelerators. 2019 IEEE High Performance Extreme Computing Conference (HPEC). External Links: ISBN 9781728150208, Link, Document Cited by: §1.
  • J. Roesch, S. Lyubomirsky, M. Kirisame, J. Pollock, L. Weber, Z. Jiang, T. Chen, T. Moreau, and Z. Tatlock (2019) Relay: A high-level IR for deep learning. CoRR abs/1904.08368. External Links: Link, 1904.08368 Cited by: §1.
  • A. Simbürger, S. Apel, A. Größlinger, and C. Lengauer (2013) The potential of polyhedral optimization: an empirical study. pp. 508–518. External Links: Document Cited by: §2.2.
  • M. Steuwer, T. Remmelg, and C. Dubach (2017) Lift: a functional data-parallel ir for high-performance gpu code generation. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO ’17, pp. 74–85. External Links: ISBN 9781509049318 Cited by: §2.2.
  • R. Tate, M. Stepp, Z. Tatlock, and S. Lerner (2009) Equality saturation: a new approach to optimization. pp. 264–276. External Links: ISBN 9781605583792, Link, Document Cited by: §2.3.
  • R. Tian, L. Guo, J. Li, B. Ren, and G. Kestor (2021) A high-performance sparse tensor algebra compiler in multi-level ir. External Links: 2102.05187 Cited by: §2.2.
  • A. VanHattum, R. Nigam, V. T. Lee, J. Bornholt, and A. Sampson (2021) Vectorization for digital signal processors via equality saturation. Cited by: §2.3.
  • N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen (2018) Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730. Cited by: §2.2.
  • Y. R. Wang, S. Hutchison, D. Suciu, B. Howe, and J. Leang (2020) SPORES: sum-product optimization via relational equality saturation for large scale linear algebra. Proc. VLDB Endow. 13 (11), pp. 1919–1932. External Links: Link Cited by: §2.3.
  • D. L. Whitfield and M. L. Soffa (1997) An approach for exploring code improving transformations. ACM Trans. Program. Lang. Syst. 19 (6), pp. 1053–1084. External Links: ISSN 0164-0925, Link, Document Cited by: §1, §2.2, §2.3.
  • M. Willsey, C. Nandi, Y. R. Wang, O. Flatt, Z. Tatlock, and P. Panchekha (2021) Egg: fast and extensible equality saturation. Proceedings of the ACM on Programming Languages 5 (POPL), pp. 1–29. Cited by: 4th item, §1, §2.3, §3.2.
  • X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina, C. Kozyrakis, and M. Horowitz (2020) Interstellar: using halide’s scheduling language to analyze dnn accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, New York, NY, USA, pp. 369–383. External Links: ISBN 9781450371025, Link, Document Cited by: §1.
  • [44] Y. Yang, P. M. Phothilimtha, Y. R. Wang, M. Willsey, S. Roy, and J. Pienaar Equality saturation for tensor graph superoptimization. arXiv preprint arXiv:2101.01332. Cited by: §2.3.
  • L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, J. E. Gonzalez, and I. Stoica (2020) Ansor: generating high-performance tensor programs for deep learning. Banff, Canada, pp. 863–879. External Links: ISBN 978-1-939133-19-9, Link Cited by: §2.2.
  • S. Zuboff (2018) The age of surveillance capitalism: the fight for a human future at the new frontier of power. 1st edition. External Links: ISBN 1610395697 Cited by: Broader Impact Statement.

Broader Impact Statement

The ability to develop effective compiler support for specialized hardware accelerators in ML, and HPC more broadly, has generally been restricted to a handful of elite, well-resourced teams. This restriction slows hardware development and creates barriers to entry for teams in less privileged environments to contribute to and help guide the development of the field.

We believe that the access pattern abstraction and Glenside’s approach to term rewriting for improving compiler support for custom accelerators will help advance both near-term practical and longer-term principled approaches to building flexible compiler infrastructure. In turn, we hope that this infrastructure will help contribute to a broader, more diverse, and more inclusive community of folks working together to build efficient technologies for social good.

Of course, all technology is political and it can be difficult to anticipate how future researchers and practitioners may apply Glenside. While the most obvious consequence of more efficient hardware utilization is better performance for users and lower environmental impact via decreased power consumption, it is also possible that access patterns and Glenside would enable the rapid obsoleting of current hardware platforms and therefore contribute to harmful electronic waste. This work could also stimulate demand for hardware customization by removing compiler development–related overheads and ultimately lead to higher negative environmental impact similar to the situation with respect to custom ASICs for bitcoin mining (Qin et al., 2020).

Also, any improvement to ML efficiency or applicability may contribute to economic and privacy concerns arising from increased technology company monopolization as discussed in Zuboff’s The Age of Surveillance Capitalism (Zuboff, 2018).