Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

04/27/2018
by   Riyadh Baghdadi, et al.
Google
adobe
MIT
Politecnico di Milano
0

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

04/27/2018

Tiramisu: A Code Optimization Framework for High Performance Systems

This paper introduces Tiramisu, an optimization framework designed to ge...
11/10/2019

Using Deep Neural Networks for Estimating Loop Unrolling Factor

Optimizing programs requires deep expertise. On one hand, it is a tediou...
02/28/2018

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers

High-performance DSL developers work hard to take advantage of modern ha...
02/12/2018

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

There is an increasing need to bring machine learning to a wide diversit...
05/02/2018

GraphIt - A High-Performance DSL for Graph Analytics

The performance bottlenecks of graph applications depend not only on the...
02/27/2019

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs

With the ubiquity of accelerators, such as FPGAs and GPUs, the complexit...
05/07/2020

TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning

In this paper, we demonstrate a compiler that can optimize sparse and re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Generating efficient code for high performance systems is becoming more and more difficult as these architectures are increasing in complexity and diversity. Obtaining the best performance requires complex code and data layout transformations, management of complex memory hierarchies, and efficient data communication and synchronization.

For example, consider generalized matrix multiplication (gemm), which computes

and is a building block of numerous algorithms, including simulations and convolutional neural networks. Highly-tuned implementations require fusing the multiplication and addition loops, as well as applying two-level tiling, vectorization, loop unrolling, array packing 

[Goto:2008:AHM:1356052.1356053], register blocking, and data prefetching. Furthermore, tuned implementations separate partial tiles from full tiles, since partial tiles cannot fully benefit from the same optimizations. High performance GPU implementations require even more optimizations, including coalescing memory accesses, managing data movement between global, shared, and register memory, and inserting synchronization primitives. Automatically generating such complex code is still beyond the capabilities of state-of-the-art compilers. The importance of kernels such as gemm motivates vendors to release immensely complex hand-optimized libraries for these kernels. However, for most users, obtaining this level of performance for their own code is challenging, since the effort required to explore the space of possible implementations is intractable when hand-coding complicated code transformations.

Fig. 1: Normalized execution times of code generated for sgemm on CPU (left) and GPU (right).

Previous work using the polyhedral model has shown success in implementing complex iteration space transformations [wolf1991loop, bondhugula_practical_2008, trifunovic_graphite_2010, polly, Vasilache2018TensorCF, pouchet.11.popl], data locality optimizations [Iri88, tobias_hexagonal_cgo13], and memory management optimizations [feautrier_array_1988, thies_unified_2001, lefebvre_automatic_1998, Qui00, Darte_contraction_2005]. Although polyhedral compilers can represent these program and data transformations, they still do not successfully select transformations that result in the best performance. Currently, these compilers do not match the performance of hand-optimized kernels for algorithms such as gemm. The blue bars in Figure 1 show the performance of state-of-the-art polyhedral compilers for gemm compared to the Intel MKL [mkl] and Nvidia cuBLAS [cublas] libraries. Fully-automatic polyhedral compilers such as Polly [polly] and Pluto [bondhugula_practical_2008] improve productivity, but do not obtain the desired level of performance since their search techniques consider only a subset of the necessary optimizations and rely on less accurate machine models, leading the compiler to make suboptimal decisions. Other polyhedral frameworks, such as AlphaZ [yuki2012alphaz] and CHiLL [chill], eschew full automation and instead expose a scheduling language that enables users to productively explore the space of possible transformations. While these frameworks achieve better performance, their scheduling languages are not designed to target distributed systems. For example, they do not allow the user to partition computations, send data across nodes, or insert required synchronization.

In this paper, we introduce Tiramisu 111http://tiramisu-compiler.org/, a polyhedral compiler with a scheduling language featuring novel commands for targeting multiple high performance architectures. Tiramisu is well-suited for implementing data parallel algorithms (loop nests manipulating arrays). It takes a high level representation of the program (a pure algorithm and a set of scheduling commands), applies the necessary code transformations, and generates highly-optimized code for the target architecture. In addition to scheduling commands for loop and data-layout transformations, the Tiramisu scheduling language introduces novel commands for explicit communication and synchronization, and for mapping buffers to different memory hierarchies. In order to simplify the implementation of the scheduling language, Tiramisu explicitly divides the intermediate representation into four layers designed to hide the complexity and large variety of execution platforms by separating the architecture-independent algorithm from code transformations, data layout, and communication. Tiramisu targets multicore CPUs, CUDA GPUs, distributed architectures, and FPGA. This paper presents the first three backends while Del Sozzo et al. [8445108] describe an FPGA backend.

The use of a scheduling language has been shown effective for generating efficient code by multiple compilers including CHiLL, AlphaZ, and Halide [halide_12, DBLP:conf/pldi/Ragan-KelleyBAPDA13]. In comparison with Halide in particular, not only does Tiramisu introduce novel scheduling extensions, Tiramisu fundamentally differs in that it relies on the expressive polyhedral representation instead of the interval-based representation used by Halide. This allows Tiramisu

to naturally express non-rectangular iteration spaces, to support programs with cyclic data-flow graphs, and to apply any affine transformation (including iteration space skewing), all of which are not naturally expressible in Halide.

This paper makes the following contributions:

  • We introduce a polyhedral compiler with a scheduling language that features novel commands for controlling data communication, synchronization, and for mapping to different memory hierarchies. These extensions enable targeting multiple high-performance architectures including multicore CPUs, GPUs, and distributed machines.

  • We explicitly divide the intermediate representation into four layers to simplify the implementation of the scheduling language. The four-layer IR separates the algorithm from code transformations and data-layout transformations, allowing for portability and simplifying the composition of architecture-specific lowering transformations.

  • We evaluate Tiramisu on a set of deep learning and linear algebra kernels and show that Tiramisu can generate efficient code that outperforms Intel MKL by up to . We also evaluate Tiramisu on a set of image processing benchmarks and show that Tiramisu matches or outperforms state-of-the-art compilers on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.

Ii Related Work

Polyhedral compilers with automatic scheduling

Polyhedral compilers such as PENCIL [pencil, pencil_paper], Pluto [bondhugula_practical_2008], Polly [polly], Tensor Comprehensions [Vasilache2018TensorCF], and PolyMage [Mullapudi:2015:PAO:2786763.2694364] are fully automatic. Some of them are designed for specific domains (such as Tensor Comprehensions and PolyMage), while Pluto, PENCIL, and Polly are more general. While fully automatic compilers provide productivity, they may not always obtain the best performance. This suboptimal performance is due to several reasons: first, these compilers do not implement some key optimizations such as array packing [Goto:2008:AHM:1356052.1356053], register blocking, data prefetching, and asynchronous communication (which are all supported by Tiramisu); second, they do not have a precise cost-model to decide which optimizations are profitable. For example, the Pluto [bondhugula_practical_2008] automatic scheduling algorithm (used in Pluto, PENCIL and Polly) tries to minimize the distance between producer and consumer statements while maximizing outermost parallelism, but it does not consider data layout, redundant computations, or the complexity of the control of the generated code. Instead of fully automatic scheduling, Tiramisu relies on a set of scheduling commands, giving the user full control over scheduling.

Polyhedral frameworks proposed by Amarasinghe et al. [Amarasinghe:1993:COC:173262.155102] and Bondhugula et al. [6877466] address the problem of automatic code generation for distributed systems. Instead of being fully automatic, Tiramisu relies on the user to provide scheduling commands to control choices in the generated code (synchronous/asynchronous communication, the granularity of communication, buffer sizes, when to send and receive, cost of communication versus re-computation, etc.).

Polyhedral compilers with a scheduling language

AlphaZ [yuki2012alphaz], CHiLL [chill, Hall2010] and URUK [Girbal2006] are polyhedral frameworks developed to allow users to express high-level transformations using scheduling commands. Since these frameworks are polyhedral, they can express any affine transformation. However, their scheduling languages do not target distributed architectures. In contrast, Tiramisu features scheduling commands for partitioning computations (for distributed systems), synchronization and distribution of data across nodes. The first four columns of Table I compare between Tiramisu and three representative polyhedral frameworks.

Feature Tiramisu AlphaZ PENCIL Pluto Halide
CPU code generation Yes Yes Yes Yes Yes
GPU code generation Yes No Yes Yes Yes
Distributed CPU code generation Yes No No Yes Yes
Distributed GPU code generation Yes No No No No
Support all affine loop transformations Yes Yes Yes Yes No
Commands for loop transformations Yes Yes No No Yes
Commands for optimizing data accesses Yes Yes No No Yes
Commands for communication Yes No No No No
Commands for memory hierarchies Yes No No No Limited
Expressing cyclic data-flow graphs Yes Yes Yes Yes No
Non-rectangular iteration spaces Yes Yes Yes Yes Limited
Exact dependence analysis Yes Yes Yes Yes No
Compile-time set emptiness check Yes Yes Yes Yes No
Implement parametric tiling No Yes No No Yes
TABLE I: Comparison between different frameworks.
Non-polyhedral compilers with a scheduling language

Halide [halide_12] is an image processing DSL with a scheduling language that uses intervals to represent iteration spaces instead of the polyhedral model. This limits the expressiveness of Halide. For example, unlike Tiramisu, Halide cannot naturally represent non-rectangular iteration spaces, and this is the reason why distributed Halide [denniston2016distributed] over-approximates the amount of data to communicate (send and receive) when generating distributed code. This also makes some Halide passes over-approximate non-rectangular iteration spaces, potentially leading to less efficient code (for example, it prevents Halide from performing precise bounds inference for non-rectangular iteration spaces). The use of intervals also prevents Halide from performing many complex affine transformations, such as iteration space skewing.

Halide does not have dependence analysis and thus relies on conservative rules to determine whether a schedule is legal. For example, Halide does not allow the fusion of two loops (using the compute_with command) if the second loop reads a value produced by the first loop. While this rule avoids illegal fusion, it prevents fusing many legal cases, which may lead to suboptimal performance. Halide also assumes the program has an acyclic dataflow graph in order to simplify checking the legality of a schedule. This prevents users from expressing many programs with cyclic dataflow. It is possible in some cases to work around the above restrictions, but such work-around methods are not general. Tiramisu avoids over-conservative constraints by relying on dependence analysis to check for the correctness of code transformations, enabling more possible schedules. Table I summarizes the comparison between Tiramisu and Halide.

Vocke et al. [Vocke:2017:EHI:3132652.3106343] extend Halide to target DSPs, and add scheduling commands such as store_in to specify in which memory hierarchy data should be stored. TVM [tvm] is another system that shares many similarities with Halide. It uses a modified form of the Halide IR internally. Since TVM is also a non-polyhedral compiler, the differences between Halide and Tiramisu that are due to the use of polyhedral model also apply to TVM.

POET [Yi:2007ay] is a system that uses an XML-based description of code and transformation behavior to parametrize loop transformations. It uses syntactic transformations, which are less general than the polyhedral transformations used in Tiramisu. GraphIt [Zhang:2018:GHG:3288538.3276491] is another compiler that has a scheduling language but that is mainly designed for the area of graph applications.

Other Compilers

Delite [chafi_domain-specific_2011] is a generic framework for building DSL compilers. It exposes several parallel computation patterns that DSLs can use to express parallelism. NOVA [Collins:2014:NFL:2627373.2627375] and Lift [Steuwer:2017:LFD:3049832.3049841] are IRs for DSL compilers. They are functional languages that rely on a suite of higher-order functions such as map, reduce, and scan to express parallelism. Tiramisu is complementary to these frameworks as Tiramisu allows complex affine transformations that are easier to express in the polyhedral model.

Iii The Tiramisu Embedded DSL

Tiramisu is a domain-specific language (DSL) embedded in C++. It provides a C++ API that allows users to write a high level, architecture-independent algorithm and a set of scheduling commands that guide code generation. Input Tiramisu code can either be written directly by a programmer, or generated by a different DSL compiler. Tiramisu then constructs a high level intermediate representation (IR), applies the user-specified loop and data-layout transformations, and generates optimized backend code that takes advantage of target hardware features (LLVM IR for multicores and distributed machines and LLVM IR + CUDA for GPUs).

Iii-a Scope of Tiramisu

Tiramisu is designed for expressing data parallel algorithms, especially those that operate over dense arrays using loop nests and sequences of statements. These algorithms are often found in the areas of image processing, deep learning, dense linear algebra, tensor operations and stencil computations.

1// Declare the iterators i, j and c.
2Var i(0, N-2), j(0, M-2), c(0, 3);
3
4Computation bx(i, j, c), by(i, j, c);
5
6// Algorithm.
7bx(i,j,c) = (in(i,j,c)+in(i,j+1,c)+in(i,j+2,c))/3;
8by(i,j,c) = (bx(i,j,c)+bx(i+1,j,c)+bx(i+2,j,c))/3);
Fig. 2: Blur algorithm without scheduling commands.

Iii-B Specifying the Algorithm

The first part of a Tiramisu program specifies the algorithm without specifying loop optimizations (when and where the computations occur), data layout (how data should be stored in memory), or communication. At this level there is no notion of data location; rather, values are communicated via explicit producer-consumer relationships.

The algorithm is a pure function that has inputs, outputs, and is composed of a sequence of computations. A computation is used to represent a statement in Tiramisu. Flow-control around computations is restricted to for loops and conditionals. While loops, early exits, and GOTOs cannot be expressed. To declare a computation, the user provides both the iteration domain of the computation and the expression to compute.

Figure 2 shows a blur algorithm written in Tiramisu. This algorithm declares two computations, bx and by. The first computation, bx, computes a horizontal blur of the input, while the second computation, by, computes the final blur by averaging the output of the first stage. The iterators i, j, and c in line 2 define the iteration domain of bx and by (for brevity we ignore boundary conditions). The algorithm is semantically equivalent to the following code.

for (i in 0..N-2)
 for (j in 0..M-2)
  for (c in 0..3)
   bx[i][j][c] =
        (in[i][j][c]+in[i][j+1][c]+in[i][j+2][c])/3
for (i in 0..N-2)
 for (j in 0..M-2)
  for (c in 0..3)
   by[i][j][c] =
        (bx[i][j][c]+bx[i+1][j][c]+bx[i+2][j][c])/3

Iii-C Scheduling Commands

Tiramisu provides a set of high-level scheduling commands for common optimizations; Table II shows some examples. There are four types of scheduling commands:

  • Commands for loop nest transformations: these commands include common affine transformations such as loop tiling, splitting, shifting, etc. For example, applying 3232 loop tiling to a computation C can be done by calling
    C.tile(i,j,32,32,i0,j0,i1,j1) where i and j are the original loop iterators and i0, j0, i1, and j1 are the names of the loop iterators after tiling.

  • Commands for mapping loop levels to hardware: examples of these include loop parallelization, vectorization, and mapping loop levels to GPU block or thread dimensions. For example, calling C.vectorize(j, 4) splits the j loop by a factor of 4 and maps the inner loop to vector lanes.

  • Commands for manipulating data: these include (1) allocating arrays; (2) setting array properties including whether the array is stored in host, device, shared, or local memory (GPU); (3) copying data (between levels of memory hierarchies or between nodes); and (4) setting array accesses. In most cases, users need only to use high level commands for data manipulation. If the high level commands are not expressive enough, the user can use the more expressive low level commands.

  • Commands for adding synchronization operations: the user can either declare a barrier or use the send and receive functions for point-to-point synchronization.

We assume that C and P are computations, b is a buffer
i and j are loop iterators
Commands for loop nest transformations
Command Description
C.tile( i,j,t1,t2,
i0,j0,i1,j1)
Tile the loop levels (i, j) of the computation C
by . The names of the new loop levels
are (i0, j0, i1, j1) where i0 is the outermost
loop level and j1 is the innermost.
C.interchange(i, j) Interchange the i and j loop levels of C.
C.shift(i, s) Loop shifting (shift the loop level i by s
iterations).
C.split(i, s, i0, i1) Split the loop level i by s. (i0, i1) are the new
loop levels.
P.compute_at(C, j) Compute the computation P in the loop nest of C
at loop level j. This might introduce redundant
computations.
C.unroll(i, v) Unroll the loop level i by a factor v.
C.after(B, i) Indicate that C should be ordered after B at the
loop level i (they have the same order in all
the loop levels above i).
C.inline() Inline C in all of its consumers.
C.set_schedule() Transform the iteration domain of C using an
affine relation (a map to transform Layer I to
II expressed in the ISL syntax).
Commands for mapping loop levels to hardware
C.parallelize(i) Parallelize the i loop level for execution on a
shared memory system.
C.vectorize(i, v) Vectorize the loop level i by a vector size v.
C.gpu(i0, i1, i2, i3) Mark the loop levels i0, i1, i2 and i3 to be
executed on GPU. (i0, i1) are mapped to block
IDs and (i2, i3) to thread IDs.
C.tile_gpu(i0,i1,t1,t2) Tile the loops i0 and i1 by and map
them to GPU.
C.distribute(i) Parallelize the loop level i for execution on a
distributed memory system.
High level commands for data manipulation
C.store_in(b,{i, j}) Store the result of the computation C(i,j) in b[i,j].
C.cache_shared_at(P,i) Cache (copy) the buffer of C in shared memory.
Copying from global to shared GPU memory
will be done at loop level i of the computation P.
The amount of data to copy, the access functions,
and synchronization are computed automatically.
C.cache_local_at(P, i) Similar to cache_shared_at but stores in
local GPU memory.
send(d, src, s, q, p) Create a send operation. d: vector of iterators
to represent the iteration domain of the send;
src: source buffer; s: size; q: destination node;
p: properties (synchronous, asynchronous,
blocking, …).
receive(d,dst,s,q,p) Create a receive operation. Arguments similar
to send except q, which is the source node.
Low level commands for data manipulation
Buffer b(sizes, type) Declare a buffer (sizes: a vector of dimension
sizes).
b.allocate_at(p, i) Return an operation that allocates b at the loop
i of p. An operation can be scheduled like
any computation.
C.buffer() Return the buffer associated to the computation
C.
b.set_size(sizes) Set the size of a buffer. sizes: a vector of
dimension sizes.
b.tag_gpu_global() Tag buffer to be stored in global GPU memory.
b.tag_gpu_shared() Tag buffer to be stored in shared GPU memory.
b.tag_gpu_local() Tag buffer to be stored in local GPU memory.
b.tag_gpu_constant() Tag buffer to be stored in constant GPU memory.
C.host_to_device() Return an operation that copies C.buffer() from
host to device.
C.device_to_host() Return an operation that copies C.buffer() from
device to host.
copy_at(p, i, bs, bd) Return an operation that copies the buffer bs
to the buffer bd at the loop i of p. Used for
copies between global, shared and local.
Commands for synchronization
barrier_at(p, i) Create a barrier at the loop p of i.
TABLE II: Examples of Tiramisu Scheduling Commands

Novel commands introduced by Tiramisu are highlighted in bold in Table II. They include array allocation, copying data between memory hierarchies, sending and receiving data between nodes, and synchronization. Calls to cache_shared_at(), cache_local_at(), allocate_at(), copy_at(), barrier_at() return an operation that can be scheduled like any other computation (an operation in Tiramisu is a special type of computation that does not return any value). The operations cache_shared_at() and cache_local_at() can be used to create a cache for a buffer (GPU only). They automatically compute the amount of data needing to be cached, perform the data copy, and insert any necessary synchronization.

                                   Tiramisu Scheduling Commands                       Pseudocode Representing Code Generated by Tiramisu
(a)
1// Scheduling commands for targeting
2// a multicore architecture.
3
4// Tiling and parallelization.
5Var i0, j0, i1, j1;
6by.tile(i, j, 32, 32, i0, j0, i1, j1);
7by.parallelize(i0);
8bx.compute_at(by, j0);
1
2 Parallel for(i0 in 0..floor((N-2)/32))
3   for(j0 in 0..floor((M-2)/32))
4    bx[32,34,3];
5    // Tiling with redundancy
6    for(i1 in 0..min((N-2)%32,32)+2)
7     for(j1 in 0..min((M-2)%32,32)+2)
8     int i = i0*32+i1
9     int j = j0*32+j1
10     for (c in 0..3)
11      bx[i1][j1][c]=
12        (in[i][j][c] + in[i][j+1][c]
13                     + in[i][j+2][c])/3
14
15    for(i1 in 0..min(N-2,32))
16     for(j1 in 0..min(M-2,32))
17     int i = i0*32+i1
18     int j = j0*32+j1
19     for (c in 0..3)
20      by[i][j][c]=
21        (bx[i][j][c] + bx[i+1][j][c]
22                     + bx[i+2][j][c])/3
(b)
1// Scheduling commands for targeting GPU.
2
3// Tile i and j and map the resulting dimensions
4// to GPU
5Var i0, j0, i1, j1;
6by.tile_gpu(i, j, 32, 32, i0, j0, i1, j1);
7bx.compute_at(by, j0);
8bx.cache_shared_at(by, j0);
9
10// Use struct-of-array data layout
11// for bx and by.
12bx.store_in({c,i,j});
13by.store_in({c,i,j});
14
15// Create data copy operations
16operation cp1 = in.host_to_device();
17operation cp2 = by.device_to_host();
18
19// Specify the order of execution of copies
20cp1.before(bx, root);
21cp2.after(by, root);
1
2 host_to_device_copy(in_host, in);
3
4 GPUBlock for(i0 in 0..floor((N-2)/32))
5  GPUBlock for(j0 in 0..floor((M-2)/32))
6   shared bx[3,32,34];
7   // Tiling with redundancy
8   GPUThread for(i1 in 0..min((N-2)%32,32)+2)
9    GPUThread for(j1 in 0..min((M-2)%32,32)+2)
10     int i = i0*32+i1
11     int j = j0*32+j1
12     for (c in 0..3)
13      bx[c][i1][j1]=
14        (in[i][j][c] + in[i][j+1][c]
15                     + in[i][j+2][c])/3
16
17   GPUThread for(i1 in 0..min(N-2,32))
18    GPUThread for(j1 in 0..min(M-2,32))
19     int i = i0*32+i1
20     int j = j0*32+j1
21     for (c in 0..3)
22      by[c][i][j]=
23        (bx[c][i][j] + bx[c][i+1][j]
24                     + bx[c][i+2][j])/3
25
26 device_to_host_copy(by, by_host);
(c)
1// Scheduling commands for targeting
2// a distributed system
3
4// Declare additional iterators
5Var is(1, Nodes), ir(0,Nodes-1), i0, i1;
6
7// Split loop i into loops i0 and i1 and
8// parallelize i1
9bx.split(i,N/Ranks,i0,i1); bx.parallelize(i1);
10by.split(i,N/Ranks,i0,i1); by.parallelize(i1);
11
12// Communicate the border rows where necessary
13send s =
14  send({is}, lin(0,0,0), M*2*3, is-1, {ASYNC});
15recv r =
16  receive({ir}, lin(N,0,0), M*2*3, ir+1,{SYNC},s);
17
18// Order execution
19s.before(r,root);
20r.before(bx,root)
21
22// Distribute the outermost loops
23bx.distribute(i0); by.distribute(i0); 
24s.distribute(is); r.distribute(ir); 
1 // We assume that in[][][] is initially
2 // distributed across nodes.  Each node
3 // has a chunk of the original
4 // in[][][] that we call lin[][][].
5
6 // Start by exchanging border rows of
7 // lin[][][]
8 distributed for (is in 1..Nodes) 
9   send(lin(0,0,0), M*2*3, is-1,{ASYNC})
10 distributed for (ir in 0..Nodes-1)
11   recv(lin(N,0,0), M*2*3, ir+1, {SYNC})
12
13 distributed for (i0 in 0..Nodes)
14   parallel for (i1 in 0..(N-2)/Nodes)
15    int i = i0*((N-2)/Nodes) + i1
16    for (j in 0..M-2)
17      for (c in 0..3)
18        bx[i][j][c] =
19          (lin[i][j][c] + lin[i][j+1][c]
20                        + lin[i][j+2][c])/3
21
22 distributed for (i0 in 0..Nodes)
23   parallel for (i1 in 0..(N-2)/Nodes)
24    int i = q*((N-2)/Nodes) + i1
25    for (j in 0..M-2)
26      for (c in 0..3)
27        by[i][j][c] =
28            (bx[i][j][c] + bx[i+1][j][c]
29                         + bx[i+2][j][c])/3
30
31 // We assume that no gather operation on
32 // by[][][] is needed
Fig. 3: Three examples illustrating Tiramisu scheduling commands (left) and the corresponding generated code (right). (a) shows scheduling commands for mapping to a multicore architecture; (b) shows scheduling commands for mapping to GPU; (c) uses commands to map to a distributed CPU machine.

The use of allocate_at(), copy_at(), and barrier_at() allows Tiramisu to automatically compute iteration domains for the data copy, allocation, and synchronization operations. This is important because it relieves the user from guessing or computing the iteration domain manually, especially when exploring different possible schedules. For example, consider copying a buffer from global memory to shared memory in a loop nest executing on a GPU. The size of the area to copy and the iteration domain of the copy operation itself (which is a simple assignment in this case) depends on whether the loop is tiled, the tile size, and whether any other loop transformation has already been applied. Tiramisu simplifies this step by automatically computing the iteration domain and the area of data to copy from the schedule.

To illustrate more Tiramisu scheduling commands, let us take the blur example again from Figure 2 and map it for execution on a multicore architecture. The necessary scheduling commands are shown in Figure 3-(a) (left). The tile() command tiles the computation by. The compute_at() command computes the tile of bx that needs to be consumed by by at the loop level j0. This transformation introduces redundant computations (in this case) and is known as overlapped tiling [Krishnamoorthy:2007:EAP:1273442.1250761]. The parallelize() command parallelizes the i0 loop.

Now let us take the same example but map the two outermost loops of bx and by to GPU. The necessary scheduling commands are shown in Figure 3-(b) (left). The tile_gpu() command tiles the computation by then maps the new loops to GPU block and thread dimensions. The compute_at() command computes the tile of bx needed by by at the loop level j0 (this introduces redundant computations). cache_shared_at() instructs Tiramisu to store the results of the bx computation in shared memory. Copying from global to shared memory will be done at the loop level j0 of by. The subsequent store_in() command specifies the access functions for bx and by. In this case, it indicates that these computations are stored in a SOA (struct-of-array) data layout (to allow for coalesced accesses). The final commands create data copy operations (host-to-device and device-to-host) and schedule them.

Suppose we want to run the blur example on a distributed system with a number of multicore CPU nodes equal to Nodes. Figure 3-(c) (left) shows the scheduling commands to use in this case. We assume that the array in[][][] is initially distributed across nodes such that node n has the chunk of data represented by in[n*((N-2)/Nodes)..(n+1)*((N-2)/Nodes),*,*]. In other words, this corresponds to row n*(N-2)/Nodes through row (n+1)*((N-2)/Nodes). This chunk is stored in the local array lin[][][].

send() and recv() define communication for the border regions. Assuming that each node has a chunk of in. The blur computation for a chunk stored in node n requires the first two rows of data from the chunk stored in node n+1. These two rows are referred to as the border region. The send() will send 2 rows ( contiguous data elements) from node is to node is-1 starting from lin(0,0,0), which corresponds to the first two rows of the chunk on node is. In response, the recv for node ir will receive 2 rows ( contiguous data elements) from node ir+1, which corresponds to ir receiving the first two rows from node ir+1. The receive for node ir places these elements starting at the end of its local chunk by starting at lin(N,0,0). Additionally, {ASYNC} defines an asynchronous send and {SYNC} defines a synchronous receive. Finally, we tag the appropriate loops (the outer loops of bx, by, s, and r), to be distributed (i.e., we tag each iteration to run on a different node).

All other scheduling commands in Tiramisu can be composed with sends, recvs, and distributed loops, as long as the composition is semantically correct.

Iv The Tiramisu IR

The main goal of Tiramisu’s multi-layer intermediate representation is to simplify the implementation of scheduling commands by applying them in a specific order. This section illustrates why, and describes the layers of the Tiramisu IR.

Iv-a Rationale for a Multi-layer IR

In this section we provide examples showing why current intermediate representations are not adequate for Tiramisu and why we need a multi-layer IR.

Most current intermediate representations use memory to communicate between program statements. This creates memory-based dependencies in the program, and forces compilers to choose data layout before deciding on optimizations and mapping to hardware. Optimizing a program for different hardware architectures usually requires modifying the data layout and eliminating memory-based dependencies since they restrict optimizations [maydan1992data]. Thus, any data layout specified before scheduling must be undone to allow more freedom for scheduling, and the code must be adapted to use the data-layout best-suited for the target hardware. Applying these data-layout transformations and the elimination of memory-based dependencies is challenging [gupta1997privatization, autoPrivatPeng, li_array_1992, feautrier_array_1988, midkiff_automatic_2012, maydan_array-data_1993, lefebvre_automatic_1998, Qui00, Darte_contraction_2005].

Another example that demonstrates the complexity of code generation is mapping buffers to shared and local memory on GPU. The amount of data that needs to be copied to shared memory and when to perform synchronization both depend on how the code is optimized (for example, whether the code has two-level tiling or not). The same applies to deciding the amount of data to send or receive when generating distributed code. Therefore, buffer mapping to memory hierarchies, communication management, and synchronization should not occur before scheduling.

Tiramisu addresses these complexities in code generation by using a multi-layer IR that fully separates the architecture-independent algorithm from loop transformations, data layout and communication. The first layer representation describes the pure algorithm using producer-consumer relationships without memory locations. The second layer specifies the order of computation, along with which processor computes each value; this layer is suitable for performing a vast number of optimizations without dealing with concrete memory layouts. The third layer specifies where to store intermediate data before they are consumed. The fourth layer adds all the necessary communication and synchronization operations.

The separation of layers defines a specific order for applying optimizations and ensures that compiler passes in a given layer need not to worry about modifying or undoing a decision made in an earlier layer. For example, the phase that specifies the order of computations and where they occur can safely assume that no data-layout transformations are required. This simple assumption allows Tiramisu to avoid the need to rely on a large body of research that focuses on data-layout transformations to allow scheduling [gupta1997privatization, autoPrivatPeng, li_array_1992, feautrier_array_1988, midkiff_automatic_2012, maydan_array-data_1993, lefebvre_automatic_1998, Qui00, Darte_contraction_2005].

Iv-B Background

In this section, we provide an overview of two main concepts used in the polyhedral model: integer sets and maps. These two concepts will be used in later sections to define the different IR layers.

Integer sets represent iteration domains while maps are used to represent memory accesses and to transform iteration domains and memory accesses (apply loop nest and memory access transformations). More details and formal definitions for these concepts are provided in [verdoolaege_isl:_2010, pencil_pact, polyhedral].

An integer set is a set of integer tuples described using affine constraints. An example of a set of integer tuples is

Instead of listing all the tuples as we do in the previous set, we can describe the set using affine constraints over loop iterators and symbolic constants as follows:

where and are the dimensions of the tuples in the set.

A map is a relation between two integer sets. For example

is a map between tuples in the set S1 and tuples in the set S2 (e.g. the tuple maps to the tuple ).

All sets and maps in Tiramisu are implemented using the Integer Set Library (ISL) [verdoolaege_isl:_2010]. We also use the ISL library notation for sets and maps throughout the paper.

Iv-C The Multi-Layer IR

Fig. 4: Tiramisu overview

A typical workflow for using Tiramisu is illustrated in Figure 4. The user writes the pure algorithm and provides a set of scheduling commands. The first layer of the IR is then transformed into lower layers, and finally Tiramisu generates LLVM or other appropriate low-level IR. Tiramisu uses integer sets to represent each of the four IR layers and uses maps to represent transformations on the iteration domain and data layout. The remainder of this section describes the four layers of the Tiramisu IR.

Iv-C1 Layer I (Abstract Algorithm)

Layer I of Tiramisu specifies the algorithm without specifying when and where computations occur, how data should be stored in memory, or communication. Values are communicated via explicit producer-consumer relationships.

For example, the Layer I representation of the code in Figure 2 for the computation by is as follows:

The first part, , specifies the iteration domain of the computation by, while the second part is the computed expression. The iteration domain is the set of tuples such that . Computations in Layer I are not ordered; declaration order does not affect the order of execution, which is specified in Layer II.

Iv-C2 Layer II (Computation Management)

Layer II of Tiramisu specifies the order of execution of computations and the processor on which they execute. This layer does not specify how intermediate values are stored in memory; this simplifies optimization passes since these transformations do not need to perform complicated data-layout transformations. The transformation of Layer I into Layer II is done automatically using scheduling commands.

Figure 3-(b) (right) shows the GPU-optimized version of the code, produced by the scheduling and data-layout commands on the left side. The corresponding Layer II representation for the by computation is shown below:

Computations in Layer II are ordered based on their lexicographical order222For example the computation is lexicographically before the computation and the computations are lexicographically before the computations . The set before the colon in the representation is an ordered set of computations. The tag gpuB for the dimension and indicates that each iteration () is mapped to the GPU block (. In Layer II, the total ordering of these tuples determines execution order.

Computations in this layer are ordered and assigned to a particular processor; the order is dictated by time dimensions and space dimensions. Time dimensions specify the order of execution relative to other computations while space dimensions specify on which processor each computation executes. Space dimensions are distinguished from time dimensions using tags, which consist of a processor type. Currently, Tiramisu supports the following space tags:

cpu the dimension runs on a CPU in a shared memory system
node the dimension maps to nodes in a distributed system
gpuT the dimension maps to a gpu thread dimension.
gpuB the dimension maps to a gpu block dimension.

Tagging a dimension with a processor type indicates that the dimension will be distributed over processors of that type; for example, tagging a dimension with cpu will execute each iteration of that loop dimension on a separate CPU.

Other tags that transform a dimension include:

vec(s) vectorize the dimension (s is the vector length)
unroll unroll the dimension

Computations mapped to the same processor are ordered by projecting the computation set onto the time dimensions and comparing their lexicographical order.

Iv-C3 Layer III (Data Management)

Layer III makes the data layout concrete by specifying where intermediate values are stored. Any necessary buffer allocations/deallocations are also constructed in this level. Tiramisu generates this layer automatically from Layer II by applying the scheduling commands for data mapping.

The data management layer specifies memory locations for storing computed values. It consists of the Layer II representation along with allocation/deallocation statements, and a set of access relations, which map a computation from Layer II to array elements read or written by that computation. Scalars are treated as single-element arrays. For each buffer, an allocation statement is created, specifying the type of the buffer and its size. Similarly, a deallocation statement is also added.

Possible data mappings in Tiramisu include mapping computations to structures-of-arrays, arrays-of-structures, and contraction of multidimensional arrays into arrays with fewer dimensions or into scalars. It is also possible to specify more complicated accesses such as the storage of computations into the array elements or into .

In the example of Figure 3-(b) (left), setting the data access using by.store_in(c,i,j) indicates that the result of the computation is stored in the array element . This command generates the following map in Layer III:

Data mapping in Tiramisu is an affine relation that maps each computation to a buffer element. Tiramisu allows any data-layout mapping expressible as an affine relation.

Iv-C4 Layer IV (Communication Management)

Layer IV adds synchronization and communication operations to the representation, mapping them to the time-space domain, and concretizes when statements for buffer allocation/deallocation occur. This layer is generated automatically from Layer III by applying user-specified commands. Any allocation or deallocation operation added in Layer III is also mapped to the time-space domain in this layer.

V Compiler Implementation

Since the main contribution of this paper is not in introducing new techniques for code generation, we only provide a high level overview of how Tiramisu generates the IR layers and target code. Throughout the section, we refer the reader to the appropriate literature for more details.

In the rest of this section we describe how scheduling commands transform Layers I, II, III and IV. We also describe how target code is generated from Layer IV.

Transforming Layer I into Layer II

Transforming Layer I into Layer II is done using two types of scheduling commands: (1) commands for loop nest transformations (such as tile(), split(), shift(), interchange()); and (2) commands for mapping loop levels to hardware (including
parallelize(), vectorize(), gpu()).

The first type of scheduling command applies a map that transforms the iteration domain. For example, when a tiling command is applied on the by computation in Figure 2, it gets translated into the following map:


This map is then applied on the Layer I representation, producing the Layer II representation. Composing transformations is done by composing different maps, since the composition of two affine maps is an affine map.

The second type of command adds space tags to dimensions to indicate which loop levels to parallelize, vectorize, map to GPU blocks, and so on.

Transforming Layer II into Layer III

This is done by augmenting Layer II with access relations. By default, Tiramisu uses identity access relations (i.e., access relations that store a computation C(i,j) into a buffer C[i,j]). If the store_in() command is used, the access relation is deduced from that command instead. Buffer allocations are also added while transforming Layer II into Layer III. The scheduling command b.allocate_at(C, i) creates a new statement that allocates the buffer b in the same loop nest of the computation C but at loop level i.

Transforming Layer III into Layer IV

Scheduling commands for data communication (send and receive), synchronization, and for copying data between global, shared and local memory are all translated into statements. For example, the send() and receive() commands are translated into function calls that will be translated into MPI calls during code generation.

V-a Code Generation

Generating code from the set of computations in Layer IV amounts to generating nested loops that visit each computation in the set, once and only once, while following the lexicographical ordering between the computations [Bas04, Iri88, Qui00]. Tiramisu relies on an implementation of the Cloog [Bas04] code generation algorithm provided by the ISL library [verdoolaege_isl:_2010]. The Tiramisu code generator takes Layer IV IR and generates an abstract syntax tree (AST). The AST is then traversed to generate lower level code for specific hardware architectures (depending on the target backend).

The multicore CPU code generator generates LLVM IR from the AST. In order to generate LLVM IR, we use Halide as a library: we first generate the Halide IR then we lower the Halide IR to LLVM IR using Halide. We do not use Halide to perform any high level code optimization. All the code optimizations are performed by Tiramisu before generating the Halide IR. The Halide compiler then lowers the Halide IR loops into LLVM IR.

The GPU code generator generates LLVM IR for the host code and CUDA for the kernel code. Data copy commands and information about where to store buffers (shared, constant, or global memory) are all provided in Layer IV. Tiramisu translates these into the equivalent CUDA data copy calls and buffer allocations in the generated code. Computation dimensions tagged with GPU thread or GPU block tags are translated into the appropriate GPU thread and block IDs in the lowered code. The Tiramisu code generator can generate coalesced array accesses and can use shared and constant memories. It can also avoid thread divergence by separating full tiles (loop nests with a size that is multiple of the tile size) from partial tiles (the remaining part of a loop).

The code generator for distributed memory systems utilizes MPI. During code generation, all the function calls for data copying are translated to the equivalent MPI function calls. The generated code is postprocessed and each distributed loop is converted into a conditional based on the MPI rank of the executing process. For example:

for(q in 1..N-1) {...} // distribute on q

becomes:

q = get_rank(); if (q1 and q<N-1) {...}

V-B Support for Non-Affine Iteration Spaces

Tiramisu represents non-affine array accesses, non-affine loop bounds, and non-affine conditionals in a way similar to Benabderrahmane et al. [Benabderrahmane]. For example, a conditional is transformed into a predicate and attached to the computation. The list of accesses of the computation is the union of the accesses of the computation in the two branches of the conditional; this is an over-approximation. During code generation, a preprocessing step inserts the conditional back into the generated code. The efficiency of these techniques was demonstrated by Benabderrahmane et al. [Benabderrahmane] and was confirmed in the PENCIL compiler [pencil]. Our experiences in general, as well as the experiments in this paper, show that these approximations do not hamper performance.

Vi Evaluation

We evaluate Tiramisu on two sets of benchmarks. The first is a set of deep learning and linear algebra benchmarks. The second is a set of image processing benchmarks.

We performed the evaluation on a cluster of 16 nodes. Each node is a dual-socket machine with two 24-core Intel Xeon E5-2680v3 CPUs, 128 GB RAM, Ubuntu 14.04, and an Infiniband interconnect. We use the MVAPICH2 2.0 [mvapich2] implementation of MPI for the distributed tests. The multicore experiments (CPU) are performed on one of these nodes. GPU experiments are performed on an NVIDIA Tesla K40 with 12 GB of RAM. Each experiment is repeated and the median time is reported.

Vi-a Deep Learning and Linear Algebra Benchmarks

Fig. 5: Normalized Execution Times for Deep Learning, Linear and Tensor Algebra Benchmarks.

We evaluated Tiramisu by implementing a set of deep learning and linear algebra benchmarks, including Conv (a direct implementation of a neural network convolution layer), VGG (a block of a VGG neural network), and sgemm (matrix multiplication used to implement convolutions), HPCG (a benchmark for multigrid preconditioned conjugate gradient, CG)333http://www.hpcg-benchmark.org/, and Baryon (a dense tensor contraction code for constructing Baryon Building Blocks [detmold2013nuclear]). For all of these benchmarks, we compare the Tiramisu implementation with Intel MKL, except for HPCG and Baryon, where we compare Tiramisu with reference implementations. Figure 5 shows a comparison between the performance of CPU code generated by Tiramisu and reference code. For sgemm and HPCG we use matrices of size and vectors of size 1060 while for Conv and VGG we use as the data input size, 16 as the number of input/output features and a batch size of 32. For Baryon, we use the same tensor sizes as in the reference code.

For sgemm, Tiramisu matches the performance of Intel MKL. sgemm is interesting in particular because the Intel MKL implementation of this kernel is well-known for its hand-optimized performance. We used a large set of optimizations to match Intel MKL. These optimizations include two-level blocking of the three-dimensional sgemm loop, vectorization, unrolling, array packing, register blocking, and separation of full and partial tiles (which is crucial to enable vectorization, unrolling, and reducing control overhead). We also used auto-tuning to find the best tile size and unrolling factor for the machine on which we run our experiments.

For the Conv kernel, Tiramisu outperforms the Intel MKL implementation because the Tiramisu-generated code uses a fixed size for the convolution filter. We generate specialized versions for common convolution filter sizes (, , , and ). This allows the Tiramisu compiler to apply optimizations that Intel MKL does not perform; for example this allows Tiramisu to unroll the innermost (convolution filter) loops since their size is known at compile time. In VGG, Tiramisu fuses the two convolution loops of the VGG block, which improves data locality. In addition, we generate code with fixed sizes for convolution filters (as we did in Conv). This provides speedup over Intel MKL. The Tiramisu speedup over the Baryon reference code is achieved through vectorization, but this vectorization is not trivial since it requires the application of array expansion and then the use of scatter/gather operations, which are both not implemented in the reference Baryon code.

Vi-B Image Processing Benchmarks

We used the following image processing benchmarks in our evaluation: edgeDetector, a ring blur followed by Roberts edge detection [roberts65]; cvtColor, which converts an RGB image to grayscale; conv2D, a simple 2D convolution; warpAffine, which does affine warping on an image; gaussian, which performs a gaussian blur; nb, a synthetic pipeline composed of 4 stages that computes a negative and a brightened image from the same input image; and ticket #2373, a code snippet from a bug filed against Halide. This code simply has a loop that assigns a value to an array but the iteration space is not rectangular (it tests if x >= r where x and r are loop iterators). The inferred bounds in this code are over-approximated, causing the generated code to fail due to an assertion during execution. Four of these benchmarks have non-affine array accesses and non-affine conditionals for clamping (to handle boundary cases): edgeDetector, conv2D, warpAffine and gaussian. We used a RGB input image for the experiments.

Fig. 6: A heatmap comparing the normalized execution times of code generated by Tiramisu with other frameworks (lower is better). Comparison is performed on three architectures: single-node multicore, GPU, distributed (16 nodes). ”-” indicates unsupported benchmarks.

We compare Tiramisu with two other compilers: Halide [halide_12], an industrial-quality DSL for image processing that has a scheduling language, and PENCIL [pencil_paper], a state-of-the-art fully automatic polyhedral compiler.

Figure 6 compares the normalized execution time of code generated by Tiramisu to other state-of-the-art frameworks on three architectures: single-node multicore, GPU and distributed (16 nodes). For the single-node multicore and GPU we compare Tiramisu to Halide and PENCIL. For the distributed architecture, we compare to distributed Halide [denniston2016distributed].

Single-node multicore

In four of the benchmarks, the performance of the code generated by Tiramisu matches the performance of Halide. We use the same schedule for both implementations; these schedules were hand-written by Halide experts. The results for edgeDetector, conv2D, warpAffine and gaussian, which have non-affine array accesses and conditionals, show that Tiramisu handles such access patterns efficiently.

Two of the other benchmarks, edgeDetector and ticket #2373, cannot be implemented in Halide. The following code snippet shows edgeDetector:

/* Ring Blur Filter */
R(i,j) =(Img(i-1,j-1) + Img(i-1,j) + Img(i-1,j+1)+
         Img(i,j-1)   +              Img(i,j+1)  +
         Img(i+1,j-1) + Img(i+1,j) + Img(i+1,j+1))/8
/* Roberts Edge Detection Filter */
Img(i,j) = abs(R(i,j)  - R(i+1,j-1)) +
           abs(R(i+1,j)- R(i,j-1))

edgeDetector creates a cyclic dependence graph with a cycle length ( R is written in the first statement and read in the second while Img is written in the second and read in the first), but Halide can only express programs with an acyclic dependence graph, with some exceptions; this restriction is imposed by the Halide language and compiler to avoid the need to prove the legality of some optimizations (since proving the legality of certain optimizations is difficult in the Halide interval-based representation). Tiramisu does not have this restriction since it checks transformation legality using dependence analysis [feautrier_dataflow_1991].

In ticket #2373, which exhibits a triangular iteration domain, Halide’s bounds inference over-approximates the computed bounds, which leads the generated code to fail in execution. This over-approximation in Halide is due to the use of intervals to represent iteration domains, which prevents Halide from performing precise bounds inference for non-rectangular iteration spaces. Tiramisu can handle this case naturally since it relies on the polyhedral model where sets can include any affine constraint in addition to loop bounds. These examples show that the model exposed by Tiramisu naturally supports more complicated code patterns than an advanced, mature DSL compiler.

For nb, the code generated from Tiramisu achieves speedup over the Halide-generated code. This is primarily due to loop fusion. In this code, Tiramisu enhances data locality by fusing loops into one loop; this is not possible in Halide, which cannot fuse loops if they update the same buffer. Halide makes this conservative assumption because otherwise it cannot prove the fusion is legal. This is not the case for Tiramisu, which uses dependence analysis to prove correctness.

The slowdown of the PENCIL compiler in gaussian is due to a suboptimal decision made by PENCIL. The gaussian kernel is composed of two successive loop nests (each of them contains three loop levels). PENCIL decides to interchange the two innermost loop levels in order to enable the fusion of the two successive loop nests. This decision minimizes the distance between producer and consumer statements (first and second loop nests), but it also reduces spatial locality because it leads to non-contiguous memory accesses. The right decision in this case is a trade-off. Such a trade-off is not captured by the Pluto automatic scheduling algorithm used within PENCIL. For the other kernels, both Tiramisu and Halide apply vectorization and unrolling on the innermost loops, while PENCIL does not since the multicore code generator of PENCIL does not implement these two optimizations. For warpAffine, both Tiramisu and Halide have a high speedup over PENCIL because the unique loop nest in this benchmark has 25 statements, and vectorizing the innermost loop transforms all of these statements to their vector equivalent while unrolling increases register reuse and instruction level parallelism on the cores of the test machine.

Gpu

For the GPU backend, the reported times are the total execution times (data copy and kernel execution). Code generated by Tiramisu for conv2D and gaussian is faster than that of Halide because code generated by Tiramisu uses constant memory to store the weights array, while the current version of Halide does not use constant memory for its PTX backend. The only difference between the schedule of Tiramisu and Halide in these benchmarks is the use of tag_gpu_constant() in Tiramisu. Data copy times, for all the filters, are the same for Tiramisu and Halide. For nb, the code generated by Tiramisu achieves speedup over that generated by Halide because Tiramisu is able to apply loop fusion, which Halide cannot apply.

Compared to PENCIL, the speedup in conv2D and gaussian is due to the fact that PENCIL generates unnecessarily complicated control flow within the CUDA kernel, which leads to thread divergence.

Distributed

We assume the data are already distributed across the nodes by rows. Of these benchmarks, nb, cvtColor and ticket #2373 do not require any communication; the other four require communication due to overlapping boundary regions in the distributed data.

Figure 6 compares the execution time of distributed Tiramisu and distributed Halide. Tiramisu is faster than distributed Halide in each case. It achieves up to speedup for conv2D. For the kernels involving communication, code generated by distributed Halide has two problems compared to Tiramisu: distributed Halide overestimates the amount of data it needs to send, and unnecessarily packs together contiguous data into a separate buffer before sending.

Fig. 7: Speedup of code generated by distributed Tiramisu for 2, 4, 8, and 16 nodes. The baseline is the execution time on 2 nodes.

Distributed Halide overestimates the amount of data it needs to send because the benchmarks have array accesses that cannot be analyzed statically (the array accesses are clamped444clamp(i, 0, N) returns if , if , otherwise. to handle boundary cases), therefore distributed Halide cannot compute the exact amount of data to send. To avoid this problem, Tiramisu uses explicit communication using the send() and receive() scheduling commands. The use of these two commands is the only difference between the Tiramisu and distributed Halide. These commands allow the user to specify exactly the amount of data to send and also allow the compiler to avoid unnecessary packing.

Figure 7 shows the speedup of the kernels with distributed Tiramisu when running on 2, 4, 8, and 16 nodes. This graph shows that distributed code generated from Tiramisu scales well as the number of nodes increases (strong scaling).

Vii Conclusion

This paper introduces Tiramisu, a polyhedral compiler framework that features a scheduling language with commands for targeting multicore CPUs, GPUs, and distributed systems. A four-layer intermediate representation that separates the algorithm, when and where computations occur, the data layout and the communication is used to implement the compiler. We evaluate Tiramisu by targeting a variety of backends and demonstrate that it generates code matching or outperforming state-of-the-art frameworks and hand-tuned code.

Acknowledgements

This work was supported by the ADA Research Center, a JUMP Center co-sponsored by SRC and DARPA.

References