Domain-Specific Multi-Level IR Rewriting for GPU

05/26/2020
by   Tobias Gysi, et al.
Google
ETH Zurich
0

Traditional compilers operate on a single generic intermediate representation (IR). These IRs are usually low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

01/10/2022

RISE Shine: Language-Oriented Compiler Design

The trend towards specialization of software and hardware - fuelled by t...
08/23/2021

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

This report presents some early results on code generation targeting ten...
04/07/2020

LLHD: A Multi-level Intermediate Representation for Hardware Description Languages

Modern Hardware Description Languages (HDLs) such as SystemVerilog or VH...
09/06/2021

QSSA: An SSA-based IR for Quantum Computing

Quantum computing hardware has progressed rapidly. Simultaneously, there...
04/17/2019

Relay: A High-Level Compiler for Deep Learning

Frameworks for writing, compiling, and optimizing deep learning (DL) mod...
04/12/2021

AI Powered Compiler Techniques for DL Code Optimization

Creating high performance implementations of deep learning primitives on...
05/21/2018

The sufficiently smart compiler is a theorem prover

That the Haskell Compiler GHC is capable of proving non-trivial equaliti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Domain-specific approaches are revolutionizing the generation of high-performance device-specific code and sparked the development of powerful domain-specific language (DSL) frameworks, often achieving performance numbers unattainable for general-purpose compilers [mullapudi2015polymage, tianqi2018tvm, rawat2015SDSLc, yask2016, sourouri2017panda, tang2011pochoir]. For example, Halide [ragan2013halide] automated the generation of high-performance code for image processing, XLA [leary2017xla]

exploited domain-specific compilation to accelerate deep learning, and Stella 

[gysi2015stella] was the first to move the weather and climate simulation to GPUs leading to 2.9 speedup [fuhrer2014stella].

The broad success of domain-specific compilers—over time—also exposed their largest weakness: their one-off implementations mostly separated from general-purpose production compiler pipelines. Halide, XLA, Stella, and others are specialized solutions for their respective domains that are not designed with reusability in mind. The small number of reusable compiler infrastructures, research-oriented such as ROSE [quinlan2000rose] or production such as LLVM [lattner2004llvm], evidences of a significant effort required to design and maintain the infrastructure compared to implementing domain-specific functionality. As a result, the ongoing trend of designing standalone DSL compilers compartmentalizes the developer communities, spreads the efforts, hinders innovation transfer, and leads us to ask: “how can we design a domain-specific compiler that (a) is cleanly decoupled from user-facing front-ends, (b) makes it easy to implement domain-transformations, and (c) clearly separates potentially generic components?”

Fig. 1: The Open Earth Compiler.

We take a practical case study-based approach to addressing this question by designing and implementing a domain-specific compiler for weather and climate modeling. While this domain uses the stencil computational pattern found in image processing [ragan2013halide] or seismic imaging [seismic], it often requires radically different optimization strategies to reach maximum performance [gysi2015stella]. Weather and climate models [harris2013fv3dycore, cosmo] operate on 3D domains and execute bandwidth-limited low-order stencils containing control flow. In comparison, image processing pipelines apply regular stencils to 2D data structures, and seismic imaging stencils are high-order and compute-intensive. On the other hand, the underlying abstraction of loops over multi-dimensional arrays, the arithmetic optimizations, and the conversion to device-specific GPU code is mostly identical across these domains, yet often reimplemented [sujeeth2014delite].

We propose to design DSL compilers using multi-level IR rewriting. This approach is a combination of (a) intermediate representations (IR) based on Static Single Assignment form (SSA) [rosen1988global], (b) operations with high-level semantics, and (c) progressive lowering, which provides an effective framework for reusable domain-specific high-performance code generation. SSA-based IRs allow us to reuse optimizations from general-purpose compilers [muchnick1998compilers]. High-level operations concisely encode domain properties and make them readily available as, e.g., SSA data flow without a need for costly analyses. Progressive lowering makes it natural to preserve domain information, to express transformations as high-level peephole optimizations [mckeeman1965peephole], and to introduce reusable lower-level abstractions. The recently introduced MLIR compiler infrastructure [lattner2020mlir] allows us to instantiate production-quality compiler IRs that follow the practice-proven IR design principles developed in LLVM [lattner2004llvm] over the last 15 years.

The Open Earth Compiler111https://github.com/spcl/open-earth-compiler/ we implemented (Fig. 1) is the first end-to-end compilation flow that leverages multi-level IR rewriting for high-performance code generation. Its core consists of a set of MLIR dialects, i.e., collections of domain-specific operations and transformations, and conversions between them. The Open Earth Compiler optimizes programs by progressively converting them from higher-level domain-specific dialects to lower-level platform-specific ones, using peephole-style rewrite patterns. Each dialect defines an abstraction that makes relevant analyses inexpensive and transformations convenient to implement.

The compilation process starts from the stencil dialect (Sec. IV) designed as a target for various user-facing DSLs as well as a data structure for domain-specific transformations such as stencil inlining (Sec. V-A). Stencils are then lowered through a series of IRs featuring index operations (Affine), structured control flow (SCF), and arithmetic operations (Standard), all of which are readily available in MLIR [lattner2020mlir], together with loop- and value-level transformations such as unrolling or common subexpression elimination. These IRs let us target a structured loop abstraction instead of low-level “goto”-based SSA IR commonly found in compiler backends. We use this structure to design a generic GPU kernel dialect (Sec. VI) and to implement loop-to-kernel conversion using simple patterns based on the parallelism information preserved from the stencil level, thus avoiding expensive GPU mapping algorithms [grosser2016pollyacc, verdoolaege2013ppcg]. The complete pipeline transforms our high-level climate-code into a fast target-specific binary.

While the stencil dialect is generic enough to cover a range of applications (e.g., image processing or seismic imaging), our focus is excellent performance for the climate domain. The semantics of our stencil operations enable us to replace complex sequences of loop transformations with generic instruction-level transformations, e.g., redundancy elimination, requiring little analysis to ensure validity. Other domains can adapt our stencil dialect to their transformation needs, or reuse only the mid- and low-level abstractions. We demonstrate that thanks to the multi-level IR rewriting, developing a domain-specific compiler with reusable components is surprisingly simple provided a sufficiently expressive infrastructure.


Our contributions are:

  • An approach to designing a modular domain-specific compiler using multi-level IR rewriting (Sec. II).

  • A stencil language expressed as an MLIR dialect, which encodes the high-level data flow of a stencil program as SSA def-use chains (Sec. IV).

  • A set of transformations to tune stencil programs at a high level using simple peephole optimizations instead of conventional loop transformations (Sec. V).

  • A separate-compilation scheme for GPUs based on a platform-neutral GPU dialect convertible to vendor-specific GPU code (Sec. VI).

  • An evaluation on benchmarks relevant to real climate models: COSMO (Europe) and FV3 (US) (Sec. VII).

Ii Multi-Level IR Rewriting

Multi-level IR rewriting promises to simplify the development of domain-specific compilers by defining a stack of reusable abstractions and by implementing the transformation at the most relevant level. The goal is to minimize the complexity of each level and to reduce the cost of analysis by encoding and preserving transformation validity preconditions directly in the IR.

Identifying pertinent domain-specific abstractions is paramount to the multi-level rewriting. Each new abstraction increases risks of excessive complexity or, on the contrary, of incompleteness where some workloads cannot be represented. We instantiate the Open Earth Compiler using the MLIR infrastructure [lattner2020mlir], carefully considering the abstractions it provides and introducing new ones when necessary. Our objective is to facilitate performance extraction from one of the two primary sources: parallelism and data locality, which often require conflicting transformations yielding complex optimization problems [zinenko2018scheduling]. Instead of attacking these problems frontally, we design abstractions so as to extract parallelism and locality information from the domain knowledge, using the following principles.

P1 Transformation-Driven Semantics. The domain abstractions for different levels of our pipeline, e.g., stencils or GPU kernels, should favor transformation-readiness over programmer-friendliness. Our objective is to build a stack of intermediate representations that enable the compiler to reason about domain-specific programs without resorting to complex analyses such as loop extraction [grosser2012polly] or dependence analysis [vasilache2006violated]. Each IR in the stack is focused on a specific set of domain transformations and designed to make all necessary information readily available. End-user usability aspects are deliberately deferred to DSL front-ends.

P2 Progressive Lowering. We aim for an effective and streamlined transformation pipeline where programs are progressively lowered [lattner2020mlir] from a high-level domain IR to a low-level target IR. The different IR abstractions should be designed to maintain high-level semantic information as long as necessary, such that a potentially complex recovery of high-level concepts can be avoided. An important additional aspect of progressive lowering in larger domain-specific compilers is that abstractions should seamlessly compose with each other to coexist in a single module while the lowering is applied selectively.

P3 Explicit Separation. Given the abstraction composability mandated by the previous principles, it is easier to combine individual pieces of the abstraction than to disentangle a complex representation. Our incarnation of the ubiquitous separation-of-concerns approach relies on the domain-relevant separation being explicit in at least some intermediate abstraction in our stack. In particular, performance-related aspects of the abstractions, such as the degree of parallelism or the memory footprints, should be present in the IR and should be modifiable separately from each other. Similarly, compile- and run-time aspects of the abstraction should be separated. In the longer term, such representations are better amenable to modern search techniques [beaugnon2017optimization, adams2019halideopt, gysi2019absinthe].

Level Concepts Transformations Sec.
- parallel stencil evaluation
- value semantic
- explicit data flow
- compile-time access offset
Stencil - compile-time domains - inlining (+CSE) - unrolling (+CSE) IV V-A V-B
Standard & - multi-dimensional storage
Affine & - affine index computation
SCF - parallel loop - loop mapping - loop to GPU V-C
- host/device code - GPU outlining
GPU - SIMT parallelism - host/device comp. VI
TABLE I: Domain-specific to device-specific abstractions.

The abstractions we use enable progressive lowering (P2) from domain-specific to device-specific concepts providing clear separation (P3) between levels. As listed in Table I, each level makes specific transformations easy to implement (P1). This multi-level representation also helps us separate optimizing transformations from the lowering between the levels that constitute a large portion of DSL compilers.

Iii The MLIR Infrastructure

MLIR is a recent production compiler infrastructure that is particularly well-suited for multi-level IR rewriting thanks to its extensibility through dialects and its built-in support for declarative rewrite patterns [lattner2020mlir]. The Open Earth Compiler can be thus implemented as a set of MLIR dialects, and transformations as rewrite patterns. Furthermore, we can readily reuse the Standard, Structured Control Flow, Affine, and LLVM IR dialects if we design our abstractions so that they compose with these.

Core MLIR concepts include operations, values, types, attributes, (basic) blocks, and regions. An operation is an atomic unit of program description. A value represents data at runtime and is always associated with a type known at compile time. Operations use values (but do not consume them) and define new values. Values can only be defined once, making the IR obey SSA form. A type holds compile-time information about a value, while attributes provide a way to attach compile-time information to operations. A block is a sequence of operations that, together with other blocks, connects to regions. A region is attached to an operation that defines its semantics. Non-trivial control flow is only allowed between an operation and the regions attached to it, and between the entry and exit points of blocks that belong to the same region. Specific operations define the structure of the control flow, for example, the last operation in a block (a terminator) can conditionally or unconditionally transfer the control flow to another block.

Fig. 2 illustrates the syntax for an example operation from our Stencil dialect. The stencil.apply operation uses a value %use and defines a value %def. Types and attributes annotate the operation with compile-time information such as the iteration domain. The nested region consist of a single basic block that implements computation performed by the operation using the basic block argument %arg. This hierarchical organization into blocks and regions enables infinite nesting.

Fig. 2: Example MLIR operation that sets 64x64x64 elements of the defined value %def to the negative of the value %use.

There is no fixed set of operations, attributes, or types. Instead, each MLIR user can define their own or reuse those defined by others. Even MLIR’s built-in functionality heavily relies on this extensibility. For example, a function is a regular operation with a region containing the function body instead of a concept on its own. Thus stencil computations can be made first-class in a stencil compiler by providing custom types, attributes, and operations. The same holds for control flow operations such as loops or data types such as multi-dimensional arrays. From an infrastructure point of view, custom operations and types are indistinguishable from common scalar operations and types.

A dialect is a set operations, attributes, and types designed to work together. There is no formal or technical restriction on how dialects are structured. Unless prescribed otherwise by the semantics of the operation, a region can contain operations from different dialects, and an operation can reference types and attributes defined by a different dialect. Therefore, new abstractions can be introduced into the MLIR ecosystem as new dialects.

Iv The Stencil Dialect

The Open Earth Compiler operates on weather and climate models. These models integrate partial differential equations forward in time commonly using either finite difference or finite volume discretization. They comprise dozens of

stencil programs (Sec. IV-D) consisting of multiple dependent stencil operators (Sec. IV-C) applied across regular or irregular grids. In this work, we consider regular three-dimensional grids that partition the space into cells, each of which with six neighbors. This regularity allows a cell to be addressed via a three-component index.

In stencil programs, optimizing individual stencils is often insufficient. Instead, chains of dependent stencils or entire programs must be optimized [gysi2015modesto], e.g., using producer-consumer fusion to obtain maximum performance. We design the Stencil dialect to represent stencil programs consisting of stencil operations connected between them and with input/output data structures through data flow, with optional control flow (Sec. IV-E), following the multi-level IR rewriting principles defined in Sec. II. Our dialect is explicitly decomposed (P3) into the high level, where we model the data flow between operators, and the low level, where we model the parallel execution of individual operators. The former enables flow rerouting transformations where a stencil operator can be seen as a unit (as opposed to lower-level IRs where a stencil is a collection of arithmetic instructions), while the latter supports parallelism exploitation (P1). The level separation also participates in progressive lowering (P2).

The dialect is not designed as a user-facing DSL, but as a compiler IR that supports transformations (P1&P3) by (a) keeping the stencil concepts high-level so they can be moved as a unit, (b) imposing no specific execution order so as to expose parallelism, and (c) using value-semantics instead of allocating storage objects to avoid costly buffer analysis.

Fig. 3: Example stencil program that evaluates a simple stencil on the array %in and stores the result to the array %out.

Iv-a Dialect Overview

The Stencil dialect focuses on concepts specific to stencils and relies on MLIR’s Standard dialect to express the actual computation (P2). Fig. 3 shows a stencil program that, for every point of a 64x64x64-element domain, adds the left and the right neighbor of the input array %in and stores the result to the output array %out. The “stencil” prefix identifies the operations and types from this dialect.

The dialect defines two types. A !stencil.field is a multi-dimensional array that stores an element for all points of the regular grid. Inputs and outputs of a stencil program have this type. A !stencil.temp is a multi-dimensional collection of elements on a hyper-rectangular subdomain of the regular grid. Temporaries have value semantics and are initially not backed by storage. Values of this type either point to a subdomain of an input array or keep the results computed by a stencil operator. Both types store single- or double-precision floating-point elements (f32 or f64) on a one-, two-, or three-dimensional domain.

The Stencil dialect also defines six operations. In the example, the stencil.assert operations specify the static shape of the arrays. The stencil.load operation takes the input array and returns a temporary that points to the input elements consumed by the subsequent stencil. The stencil.apply operation executes the stencil and defines a value that keeps the results of the computation. Its nested region implements the stencil operator for one point of the iteration domain. The stencil.access operations read the input temporary at a constant offset relative to the current position in the iteration domain, while the stencil.return operation sets the output at the current position. In between, we use the Standard dialect to sum the left and the right neighbor elements. The stencil.store operation finally stores the result of the computation to the output array. Its range attribute thereby specifies the domain written by the stencil program. Our compiler utilizes the output ranges to automatically infer the access ranges and iteration domains for the entire stencil program (cf. Sec. V-B).

Fig. 4: Example range (left) defined by an inclusive lower and an exclusive upper bound and stencil accesses (right) expressed relative to the current position (i=1, j=1).

Iv-B Shapes & Domains

The range notation is essential to specify stencil iteration domains and access ranges, especially given that stencils may be accessing inputs with indices that are outside of their computation domain, e.g., on boundaries.

Fig. 4 shows our range notation (left) for a two-dimensional domain. The origin denotes the lower bound of the computation domain and has all coordinates set to zero. Ranges are specified given the absolute coordinates of an inclusive lower bound and an exclusive upper bound separated by a colon.

On GPUs, integer index computations are a significant performance bottleneck. As real-world stencil programs often execute stencils repeatedly for the same problem size, it is desirable for the stencil dialect to support size specialization for just-in-time compilation. It does so by defining storage shapes and iteration domains as numeric compile-time attributes (P3).

Iv-C Stencil Operators

A stencil operator performs element-wise computations on all elements of a regular grid except for some constant-width boundary. It accesses the elements of input arrays at constant offsets relative to the coordinates of the output element.

The stencil.apply operation contains a region that implements the stencil operator in terms of scalar operations on domain elements. The scalar operations are applied to all domain elements as in a loop nest. Stencil operator inputs and outputs correspond to values used and defined by this operation. Inputs are assumed to not alias, and element-wise computations are assumed to be independent (P3).

Individual elements of inputs are accessed using the stencil.access operation that reads an element at a constant offset. The lowering of the Stencil dialect later adds the constant access offset to the index of the current iteration (cf. Sec. V-C). Fig. 4 shows the offset computation (right) for a two-dimensional stencil iteration domain. The region of the stencil operator must be terminated by a single stencil.return operation that accepts the value of the output element as an argument. Together, the stencil.access and stencil.return operations specify the memory access pattern of the stencil operator. Both of them are only valid as part of the stencil operator definition.

Real-world stencil programs from the weather and climate domain often implement dozens of dependent stencil operators. A stencil program thus needs additional means to orchestrate them.

Iv-D Stencil Programs

A stencil program executes a sequence of dependent stencil operators. It loads the data from the input arrays, implements the stencil operators inline, and stores the results to the output arrays. The SSA def-use graph of the program thus specifies the high-level data flow between the stencil operators (P2). Having both the high-level data flow and the inlined stencil operators in a single function facilitates code transformations across multiple stencil operators, eliminating any need for complex interprocedural analysis (P1).

Three additional operations are part of the program definition. The stencil.assert operation specifies the index range for an input or output array. A valid stencil program needs to define the index range for all input and output arrays. The stencil.load operation returns a temporary value that contains all input array elements accessed by dependent stencils. Conversely, the stencil.store operation stores the output of a stencil operator to the output array elements denoted by its range attribute.

Fig. 5: Stencil program that evaluates two dependent stencils.

Fig. 5 shows a stencil program that executes two dependent stencils. The stencil.load operation returns a temporary holding the accessed %in array elements. The second stencil operator uses the result of the first. In the end, the stencil.store operation stores the values computed by the second stencil to the %out array.

All stencil program parameters have to be alias-free and are either loaded from or stored to as a unit. Intermediate results are kept in values of type !stencil.temp that are not initially backed by storage and are thus also alias-free. Given the value semantics of !stencil.temp, the def-use graph encodes the data dependencies between the stencil operators (P1&P3).

Iv-E Control Flow

Real-world stencil applications are not limited to pure data flow semantics. We can use eager execution and just-in-time compilation to handle most of the control-flow at the program level and resort to MLIR’s built-in Structured Control Flow (SCF) dialect to implement dynamic control flow inside the stencil operators.

Fig. 6: The SCF dialect enables the implementation of control flow inside the stencil operator.

Fig. 6 shows a stencil that, depending on a flag, accesses one of two arguments. The scf.if operation conditionally executes either the “then” or the “else” region. In contrast to a regular if-else, the operation returns a result value that is set by the scf.yield operations. This representation makes the data flow explicit and maintains a single stencil.return operation per stencil. An alternative to the scf.if operation, is the select operation that chooses a value based on a condition. Supporting the scf.if operation requires no adaptation of our compiler (P2).

Our choice of the built-in MLIR SCF dialect exemplifies how progressive lowering, explicit separation, and composable abstractions enable reuse of compiler components in the multi-level IR rewriting scheme.

V Stencil Transformations

We distinguish three categories of transformations that work on the Stencil dialect: 1) performance optimizations, 2) transformations to prepare the lowering, and 3) the actual lowering.

V-a Optimizing Transformations

All optimizing transformations implemented for the Stencil dialect operate at a high-level and neither introduce explicit loops nor storage allocations (P1).

Fig. 7: Two patterns enable the iterative producer-consumer fusion for entire stencil programs. The def-use edges represent the data flow between the stencil operations.

The stencil inlining pass applies fusion on the def-use graph of the stencil program. In particular, we repeatedly apply a stencil specific variant of producer-consumer fusion that replaces all accesses to producer results by inline computation. If the consumer accesses the producer at multiple offsets, we thus perform redundant computation for every point in the iteration domain. Inlining stencils in an arbitrary order may introduce circular dependencies. An input of the consumer may, for example, depend on another stencil that transitively depends on an output of the producer stencil. Instead of developing an algorithm to fuse the stencils in a valid order, we implement patterns that match and rewrite small subgraphs and use MLIR’s greedy rewriter to apply them step-by-step.

Fig. 7 shows our inlining patterns. The inliningpattern matches a producer P and a consumer C if the producer has a single consumer. If the pattern matches, we remove the producer stencil and inline the computation into the consumer. Additionally, we update the argument and result lists of the fused stencils. The reroute pattern matches a producer P and its consumers C1 to CN. If the pattern matches, we route all outputs of the producer through the consumer that executes next. The red (dashed) arrows mark the rerouted data dependencies. The former pattern implements the actual inlining, while the latter pattern prepares an inlining step.

Our inlining implementation introduces redundant computation even if the consumer accesses the same offset multiple times and always inlines the entire producer even if only one of its outputs is accessed. Dead code elimination and common subexpression elimination later clean up the code. These transformations rely on the stencil accesses being side-effect free (the stencil inputs are immutable and do not alias with the outputs). Our compiler currently implements no fusion heuristic and continuous inlining as long as one of the patterns applies.

Fig. 8: Unrolling two iterations of the example stencil along the j-dimension.

The stencil unrolling pass replicates a stencil operator multiple times to update more than one grid point at once. Fig. 8 shows an unrolled version of our example program.

Unrolling is another example of a classical loop transformation implemented by our high-level dialect. Instead of transforming loops, our implementation annotates the high-level Stencil dialect and directly lowers to unrolled loops. In particular, we only modify the nested region attached to the stencil.apply operation but not its interface. Initially, we replicate the stencil computation once for every unrolled loop iteration and adjusts the access offsets. We also adapt the stencil.return operation to return the results of all unrolled loop iterations and annotate the unroll factor and dimension using an optional attribute.

The unrolling pass supports all unroll dimensions and unroll factors. Yet, the lowering is currently limited to unroll factors that divide the domain size evenly.

Inlining and unrolling improve the performance of stencil programs. Especially inlining reduces the off-chip data movement at the cost of introducing redundant computation. Unrolling can eliminate parts of the redundant computation since the unrolled stencil operator often evaluates the producer several times at the same offset. Instead of removing the redundant computation ourselves, we run the existing common subexpression elimination pass.

V-B Preparing the Lowering

After optimizing the stencil program, we infer all access ranges and iteration domains to prepare the lowering (P2).

Fig. 9: Conversion of the Stencil dialect to the MLIR SCF+Affine+Standard dialects that further lower to GPU abstractions.
Fig. 10: Lowering of an example kernel: 1) the inline form enables host device code motion and other transformations, 2) the function form isolates the device code in a distinct module that enables device-specific optimizations and separate host/device compilation, and 3) the binary embeds the kernel as constant data.

The shape inference pass derives the access ranges for the input arrays and stencil operators of the program. It is necessary since a stencil program only defines the output ranges written by the program. The pass then starts from these output ranges and follows the use-def chains that define the dependencies of the stencil program and transitively extends the access ranges.

Our algorithm walks all operations of the stencil program in reverse order and annotates the access ranges using optional range attributes (Sec. V-C shows the lowering of the annotated example program). We compute these ranges as the minimal bounding box that contains all access extents of operations that consume a value defined by the current operation. If the consumer is a stencil.load operation, its access extent is equal to the output range attribute. If the consumer is a stencil.apply operation, the access extent is equal to the iteration domain extended by a minimal bounding box that contains all stencil accesses of the consumed values. Once the access range of a stencil.load operation is known, we also verify the input array is large enough.

Although the access extent analysis seemingly contradicts the progressive lowering idea (P2), it does not aim at recovering information that has been there before. Instead, it automates the error-prone manual access range specification.

Shape inference enables the lowering and has no performance impact.

V-C Lowering to Explicit Loops

The stencil lowering applies conversion patterns to translate the individual stencil operations to their MLIR counterparts (P2). It is the last domain-specific part of our compilation pipeline, outlined in Fig. 1, that lowers our high-level stencil programs towards executable code.

Even at the Standard dialect level, MLIR provides rather high-level abstractions. The memref is a structured multi-dimensional buffer abstraction. It can have static or dynamic sizes, and an optional layout attribute defines the index computation if the layout diverges from the row-major format. This layout attribute also allows one to define strided hyper-rectangular views into a memory buffer, for example, with offsets and non-unit steps along each of the dimensions. Another example is the scf.

parallel operation that models a parallel multi-dimensional loop.

Fig. 9 illustrates the lowering from the stencil dialect level to the MLIR Standard dialect level for the example introduced in Sec. IV-A. We define six conversion patterns that introduce explicit loops, index computations, memory accesses, and temporary storage. After this lowering, detecting stencil operators or access offsets requires analysis. Implementing domain-specific transformations consequently becomes harder. In turn, by introducing loops and temporary storage, we settle to program execution order but still maintain the parallel semantics needed for the subsequent GPU lowering (P2).

Vi The GPU Dialect

Since GPUs remain a platform of choice to achieve high performance, we construct our multi-level compiler to target these devices. We designed following the principles defined in Sec. II and implemented the GPU dialect for MLIR to this end with the goal of abstracting the GPU execution model in a vendor-independent way. In particular, it generalizes MLIR’s NVVM, ROCm, and SPIR-V representations and thus separates unified platform-independent device mapping (P1) from platform-specific code generation (P3). The GPU dialect is not intended as a generic SIMT execution model (P3), nor as a raising target from lower-level abstractions (P2). The dialect exposes a set of GPU-specific concepts: hierarchical thread structure (blocks, threads, warps); synchronization through barriers; memory hierarchy (global, shared, private, constant memory); standard computational primitives such as parallel reductions. It is also designed to support separate host/device compilation in a single module (P3). The latter is made possible by MLIR modules recursively containing other modules that can be processed differently.

Fig. 10 shows the two forms of a kernel launch during the GPU lowering. The inline form uses the gpu.launch operation to define the kernel inline. A nested region implements the kernel, and basic block arguments provide access to the thread and block identifiers. Explicit parameter handling is not needed since the values defined outside of the nested region remain visible. The function form uses the gpu.func operation to implement the kernel as a separate function in a dedicated module launched by the gpu.launch_func operation that represents the kernel invocation. Special operations provide access to the thread and block identifiers. All non-constant kernel arguments are passed in explicitly, while the constants are propagated into the kernel functions. Both the inline and the function form accept a GPU grid configuration and support the declarative allocation of buffers in the different levels of the GPU memory hierarchy. The kernel code expresses the computation for a single thread, following the SIMT model, and specialized mechanisms provide access to the thread and block identifiers. Thereby GPU-specific primitives such as barrier synchronization, shuffles, and ballots are only available inside a kernel launch.

Additionally, Fig. 10 illustrates the main steps of the GPU lowering, starting from the inline form (left), through the function form (middle), down to the compiled binary (right). A parallel loop nest can be converted in-place to the inline form, using loop bounds as GPU grid configuration. After the conversion, we apply common subexpresison and dead code elimination, canonicalization and propagate constants inside the GPU kernel to minimize host/device memory traffic. Common SSA-based transformations apply seamlessly across the host/device boundary thanks to the kernel being inlined with no visibility restrictions. The kernel is then outlined into a separate function in a dedicated GPU module. Functions called by the kernel are copied into the module, and values defined outside the kernel are passed in as function arguments. This results in kernels living in a separate module to enable the separate host/device optimization and compilation. The kernel bodies are no longer visible to intra-procedural optimizations on the host code. The GPU module is finally converted through a dedicated dialect to a platform-specific representation (e.g., PTX), and using the vendor compiler (e.g., ptxas) compiles further down to a binary. The resulting binary is embedded as a global constant into the original module. This approach enables, e.g., multi-versioning to support multiple architectures or kernel specialization for different-sized workloads. The original module extended with the binary constants then becomes a regular host module, that can be optimized, compiled, and executed. The kernel invocations thereby lower into calls to the device driver library or runtime environment.

Vii Evaluation

We evaluate the Open Earth Compiler on real-world stencil programs derived from the most performance critical parts of the COSMO and FV3 climate models and compare its performance to state-of-the-art code generation frameworks.

Vii-a Experimental Setup

We run our experiments on an NVIDIA Tesla V100-SXM2 with a memory bandwidth of 900 GB/s. We use the CUDA toolkit 10.1 with driver version 435.21 and benchmark two domain sizes 128x128x60 (small) and 256x256x60 (large) for single-precision (f32) and double-precision (f64) floating-point numbers. We further measure the execution times on an AMD Radeon RX 5700 with a memory bandwidth of 448 GB/s. Due to the low double-precision peak performance of the gaming-oriented hardware, we omit the double-precision results for the AMD system. For all benchmarks, we report the median runtime of 100 measurements, and red error bars show the quartile runtime to quantify the measurement error. We do not time the initial kernel executions to avoid startup overheads such as host to device data copies. Additionally, we use the nvprof profiler to collect detailed performance data on the NVIDIA system. We guarantee correctness by comparing the outputs of all optimized kernel variants to naive C versions and ensure the results are within a relative error of

for single-precision (f32) and for double-precision (f64) floating-point numbers.

Name Dims Apply Inputs/ Arith. Access Control
Ops Outputs Ops Ops Flow
COSMO
hdiffsa 2 4 4 / 1 21 22 min
hdiffsmag 2 6 8 / 2 56 38 min/max
hadvuv 2 8 6 / 2 80 45 if
hadvuv5th 2 8 6 / 2 112 53 if
fastwavesuv 3 6 9 / 2 43 32 -
FV3
p_grad_c 3 3 7 / 2 24 25 -
nh_p_grad 3 5 8 / 2 47 48 -
uvbke 2 2 4 / 2 12 12 -
fvtp2d_qi 2 5 5 / 2 27 23 if
fvtp2d_qj 2 8 6 / 3 49 39 if
fvtp2d_flux 2 5 7 / 2 28 22 if
TABLE II: Characteristics of our benchmarks.

Vii-B Benchmark Kernels

We evaluate our compiler for a set of representative benchmarks222https://github.com/spcl/open-earth-benchmarks derived from the dynamical cores of two popular climate and weather models. COSMO [cosmoorg] is a regional numerical weather prediction model used by seven national weather services in Europe. FV3 [fv3] is the dynamical core of the CM4 and GEOS-5 global climate models and the global weather prediction system of the US National Weather Service. The dominant algorithmic motif of both codes are stencil computations on regular grids. Both models implement dozens of stencil operators to perform the numerical forward integration in time. Due to the explicit time integration, most stencils are purely horizontal with bounded domains of dependence. Some use the Thomas algorithm to perform implicit integration in the vertical direction.

All our benchmarks are part of the explicit integration. The hadvuv and hadvuv5th kernels implement third- and fifth-order horizontal advection, while the fvtp2d kernels implement a monotone two-dimensional finite volume advection operator. The hdiffsa and hdiffsmag kernels perform horizontal diffusion. The fastwavesuv kernel contains parts of the sound wave forward integration, while the p_grad_c and nh_p_grad kernels compute the three-dimensional pressure gradient. Finally, the uvbke kernel is a preprocessing step for the kinetic energy computation in FV3.

Each benchmark executes an entire stencil program consisting of multiple stencil operators being applied on the three-dimensional domain. The stencil operators have different dimensionality (from one- to three-dimensional), have different width (two- to five-point), and some of them contain dynamic control flow. Table II list core characteristics of our benchmarks such as the dimensionality of the stencil access patterns or the number of stencil operators, input/output arrays, arithmetic operations, and stencil.access operations. We observe that all kernels have a low arithmetic intensity (arithmetic operations per memory access), which explains why our compiler focuses on transformations to increase the data-locality.

Fig. 11: Runtimes at different optimization levels for f32 (top) and f64 (bottom) floating-point values for 256x256x60 on the NVIDIA system.
Fig. 12: Runtimes at different optimization levels for f32 floating-point values for 128x128x60 (top) and 256x256x60 (bottom) on the AMD system.

Vii-C Effectiveness of our Code Transformations

We first evaluate the effectiveness of the code transformations discussed in Sec. V-A. In total, we compare different optimization levels: 1) original, 2) inline, 3) inline+unroll(2), and 4) inline+unroll(4). Optimization level one applies no optimizing transformations (cf. Sec. V-A). Starting from optimization level two we apply stencil inlining, and the optimization levels three and four additionally perform stencil unrolling by factor two and four, respectively.

Fig. 11 and Fig. 12 compare the runtime for all benchmarks at different optimization levels on the NVIDIA and the AMD system, respectively. We choose these platforms to demonstrate our device-specific code generation for two entirely different instruction set architectures and runtime environments. We show data for both problem sizes (AMD) and f32 and f64 arithmetic (NVIDIA) and observe significant speedups for stencil inlining independent of benchmark and system. In comparison, stencil unrolling has a smaller effect and sometimes is even detrimental to performance. In the plots, we annotate the speedup and the runtime of the best performing version.

Fig. 13: Number of float instructions (top), integer instructions (middle), and the device memory transactions (bottom) executed at different optimization levels in comparison to COSMO and FV3 reference implementations (cf. Sec. VII-D) for f64 floating-point values and 256x256x60 (these measurements are profiling results collected using nvprof).
Fig. 14: Utilization of the peak compute throughput (top) and the peak memory bandwidth (bottom) for the best performing kernel variants in percent (these measurements are profiling results collected using nvprof).

Inlining all stencil operators eliminates the accesses of temporary buffers, and the program inputs are loaded precisely once. We thus expect stencil inlining to have a strong effect on performance due to the resulting bandwidth reduction. Fig. 13 confirms that stencil inlining reduces the data movement between the device memory and the L2 cache (bottom). At the same time, inlining introduces redundant computation. Fig. 13 quantifies this extra computation (top) but also shows that stencil inlining eliminates parts of the index computation (middle). Overall the bandwidth reduction and the eliminated index computation overcompensate the additional redundant computation, especially considering the low arithmetic intensity of our kernels.

Stencil unrolling removes redundant computations and improves the data-locality in case the data accesses of the unrolled loop iterations overlap. Fig. 13 confirms that stencil unrolling reduces both redundant computation (top) and index computation (middle). We also observe less device memory transactions (bottom) for the three-dimensional fastwavesuv, nh_p_grad, and p_grad_c kernels due to improved data-locality. At the same time, stencil unrolling increases the register pressure and reduces the available parallelism. As a consequence, we do not expect all stencil programs to benefit from stencil unrolling. Instead, the optimization’s effectiveness depends on the complexity of the stencil program (register pressure), the available parallelism, and the potential bandwidth reduction.

Fig. 14 illustrates the memory bandwidths and compute throughputs achieved by the best performing kernel versions. We observe very high memory bandwidth utilization up to 86% and low compute utilization below 22%. These results demonstrate the importance of aggressive stencil inlining and confirm the redundant computation is less of a concern.

In summary, we show that our code transformations yield significant speedups. We attain comparable speedups and bandwidth rectified performance on both the NIVIDA and the AMD system despite their entirely different instruction set architecture and runtime environment, demonstrating the effectiveness of our vendor-independent GPU execution model introduced in Sec. VI. Selecting the optimal unroll factor or finding good fusion choices for a specific benchmark and target system combination is not the scope of our work. Instead, we employ empirical tuning to find the best unroll factor and always fuse all stencil operators (optimizing larger stencil programs will require a fusion heuristic).

Vii-D Comparison to State-of-The-Art Solutions

We now compare the runtime of the kernels optimized by our compiler to Stella [gysi2015stella] and Dawn [osuna2020dawn] generated CUDA [cuda2008] implementations. Stella is a C++ embedded DSL used to run the GPU version of COSMO in production. Dawn is a research compiler that lowers high-level stencil programs to efficient CUDA code. Both implement overlapped tiling [holewinski2012overlapped] using shared memory and stream data [rawat2016streaming] in registers along the k-dimension. This execution model limits the redundant computation to the tile boundaries. In comparison, our stencil inlining plus unrolling performs additional redundant computation at the thread level but requires no thread synchronization during the kernel execution.

Fig. 15: Speedup of our compiler over Stella (COSMO) and Dawn (FV3) for 128x128x60 (top) and 256x256x60 (bottom) on the NVIDIA system.

Fig. 15 compares for all benchmarks the best performing variant generated by the Open Earth Compiler to the Stella (COSMO) and Dawn (FV3) counterparts. We outperform the state-of-the-art and obtain speedups that range from 1.4x to 1.6x depending on problem size and precision. We thus perform better for all configurations especially for the smaller problem size and single-precision arithmetic. The speedups are similar, with the exception of the fastwavesuv kernel. Stella implements the fastwavesuv kernel using a sequential loop in the k-dimension, which results in suboptimal performance for the smaller problem sizes.

We attribute the performance of our compiler to the simple execution model. It fuses all stencil operators and stores temporaries in registers to limit the data movement, and it introduces redundant computation instead of thread synchronizations to avoid parallelization overheads. Its only disadvantage is the redundant computation, which due to the low arithmetic intensity of our kernels shown in Fig. 14, is less critical. In Fig. 13, we indeed observe lower device memory bandwidth requirements (bottom) and notable amounts of redundant computation (top) compared to Stella (COSMO) and Dawn (FV3) reference implementations. Having compile-time information about storage shapes and iteration domains in return eliminates parts of the index computation (middle). Avoiding synchronizations at the cost of additional computation and leveraging compile-time information thus turn out to be beneficial compared to the Stella and Dawn execution models.

Our compiler, despite its simplicity, outperforms Stella and Dawn on raw stencil programs. The results demonstrate the quality of our code generation and the potential of stencil inlining plus unrolling compared to overlapped tiling and streaming, the standard solution in the field.

Fig. 16: GT4Py version of the example stencil program.

Vii-E Lowering User-facing Code to the Stencil Dialect

We design our Stencil dialect as a target for user-facing domain-specific languages. To study the feasibility of such a lowering, we integrate our compiler with GridTools for Python (GT4Py) [gt4pygithub], an embedded domain-specific language for weather and climate.

In Fig. 16, we show a GT4Py version of the example stencil program introduced in Sec. IV-A. GT4Py lowers the user code to an internal IR, consisting of a compute domain, data field descriptors, and a set of computations in the form of an abstract syntax tree (AST). We traverse the AST to emit MLIR, compile the resulting MLIR program to a binary, and produce Python bindings to link the generated binary to the calling program.

The functional lowering demonstrates our compiler’s applicability in the context of an end-to-end solution for the weather and climate domain.

Viii Related Work

Accelerated systems made programming model innovations inevitable. Kokkos [edwards2013kokkos] and Raja [rajagithub] are C++ performance portability layers. PENCIL [baghdadi2015pencil] and Polly-ACC [grosser2016pollyacc] automate the accelerator mapping using the polyhedral model. DaCe [bennun2019dace] allows performance engineers to select and develop target-specific transformations. All approaches are generic and, for the same level of performance and automation, solve a more complex problem than a domain-specific compiler.

Machine learning today drives the development of domain-specific compilers [leary2017xla, tianqi2018tvm]. The development of stencil compilers started even earlier: Halide [ragan2013halide] and Polymage [mullapudi2015polymage] tune image processing pipelines, Pochoir [tang2011pochoir] implements cache-oblivious tiling, SDSLc [rawat2015SDSLc] supports many targets (SIMD, GPU, and FPGA), Panda [sourouri2017panda] supports distributed memory, and YASK [yask2016] specifically targets Intel processors. Lift [hagedorn2018liftstencils] has also been shown effective for stencil codes. The variety of different solutions demonstrates the importance of a shared compiler infrastructure.

Multiple projects work on solutions for weather and climate. The CLIMA [climagithub] effort develops a novel earth system model using the Julia language. The LFRic [adams2019lfric] climate modeling system relies on the Python-based PSyclone compiler. Stella [gysi2015stella] and GridTools [gridtoolsgithub] use C++ template metaprogramming to support CPU and GPU systems. CLAW [clement2018claw] and Hybrid Fortran [mueller2018hybridfortran] extend Fortran to achieve performance portability. Despite their heterogeneity, all of these approaches could benefit from a shared compiler infrastructure.

Several frameworks support the development of domain-specific compilers. AnyDSL [leissa2018anydsl] supports partial evaluation using minimal annotations in the Impala front end language. Lightweight modular staging [rompf2010staging] is a technique that uses Scala’s type system to transform codes before their execution. It forms the basis of the Delite [sujeeth2014delite] compiler framework. Lua script similarly supports staging via the Terra [devito2013terra] low-level language. Lift [steuwer2017lift] finally combines a functional language and rewrite rules to generate performance portable code. MLIR [lattner2020mlir] is the only full-fledged compiler infrastructure among these contenders, not limited in terms of optimizations, and not tied to a particular front end language.

Stencil optimizations for GPU targets are a well-researched topic. Tiling [nguyen2010blocking, holewinski2012overlapped, maruyama2014optimizing, grosser2014hybrid, grosser2014split, matsumura2020an5d] and fusion [rawat2018fusion, gysi2015modesto, wahib2014fusion] are the core optimizations for bandwidth-limited low-order stencils as they appear in weather and climate. Other works optimize the resource utilization [rawat2018regopt, rawat2016streaming] or discuss the optimization of high-order stencils [zhao2019stencil, rawat2015SDSLc]. Our compiler implements a variant of overlapped tiling [holewinski2012overlapped] that introduces redundant computation for every thread.

Ix Conclusion

We introduced multi-level IR rewriting, an approach to building reusable components for domain-specific compilers. This approach is illustrated through the design and implementation of the Open Earth Compiler, which provides a high-performance compilation flow for weather and climate modeling. We demonstrated that thanks to multi-level IR rewriting, a small yet self-consistent set of high-level operations specifically designed for stencil computations is sufficient to achieve better performance than state-of-the-art DSL compilers. Contrary to the latter, the Open Earth Compiler relies on existing and new reusable compiler abstractions, including the GPU kernel abstraction we introduced, by decoupling domain-specific and target-specific code transformations. Our evaluation of eleven stencil programs relevant to existing climate models, COSMO and FV3, demonstrates that the Open Earth Compiler generates code that is up to 3.2x faster than state-of-the-art solutions and delivers a geomean speedup between 1.4x and 1.6x across problem sizes and precisions. We suggest that multi-level IR rewriting and the associated design principles is a promising approach to rapidly design and deploy domain-specific compilers that can take advantage of reusable components of the MLIR ecosystem.

Acknowledgements

We thank Jean-Michel Gorius for his foundational stencil compiler work and the continuous support of our project. We appreciate the assistance of Carlos Osuna and the MeteoSwiss compiler team throughout the project. Finally, we thank Hannes Vogt for providing the initial out-of-tree development setup and Felix Thaler for sharing performance results that inspired our work. This project has received funding from a Huawei donation and through the Swiss National Science Foundation under the Ambizione programme (grant PZ00P2168016) as well as ARM Holdings plc and Xilinx Inc in the context of Polly Labs.

References