High-Performance Code Generation though Fusion and Vectorization

10/24/2017
by   Jason Sewall, et al.
Intel
0

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware. We introduce representations for the abstract relationships and data dependencies of kernels in loop nests and algorithms for manipulating them into more efficient form; we similarly introduce techniques for determining data access patterns for stencil-like array accesses and show how this can be used to elide storage and improve vectorization. We discuss our prototype implementation of these ideas---named HFAV---and its use of a declarative, inference-based front-end to drive transformations, and we present results for some prominent codes in HPC.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

05/07/2022

Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers

The new open and royalty-free RISC-V ISA is attracting interest across t...
11/24/2021

Composing Loop-carried Dependence with Other Loops

Sparse fusion is a compile-time loop transformation and runtime scheduli...
12/22/2021

Survey the storage systems used in HPC and BDA ecosystems

The advancement in HPC and BDA ecosystem demands a better understanding ...
05/10/2017

Performance Evaluation and Modeling of HPC I/O on Non-Volatile Memory

HPC applications pose high demands on I/O performance and storage capabi...
08/20/2017

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Big Data query systems represent data in a columnar format for fast, sel...
02/25/2020

Safe and Efficient Remote Application Code Execution on Disaggregated NVM Storage with eBPF

With rapid improvements in NVM storage devices, the performance bottlene...
05/18/2018

A syllogistic system for propositions with intermediate quantifiers

This paper describes a formalism that subsumes Peterson's intermediate q...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.

We introduce representations for the abstract relationships and data dependencies of kernels in loop nests and algorithms for manipulating them into more efficient form; we similarly introduce techniques for determining data access patterns for stencil-like array accesses and show how this can be used to elide storage and improve vectorization.

We discuss our prototype implementation of these ideas—named HFAV—and its use of a declarative, inference-based front-end to drive transformations, and we present results for some prominent codes in HPC.

1 Introduction

Hardware evolves, and languages can evolve with it, but codes remain static unless rewritten. This article presents a method for automatically generating stencil-like codes—based on existing code and simple rules—that run well on modern hardware.

1.1 Complex solutions, simple languages

The longevity of many programming languages has been both a boon and a hindrance to computing. Fortran, famously the first high-level programming language11todo: 1need a reference?, has been in continuous use since 1957. C and C++ first appeared in 1972 and 1983, respectively22todo: 2need a reference?. Billions of lines of code have been written in these languages, providing a huge library of code for posterity, and their longevity means that developers have had many years in which to become familiar with the languages.

These languages were developed on and for hardware long deemed obsolete, in computing environments that would be nearly unrecognizable to most programmers today. New ideas about programming have come (and some have gone) during their lifetimes. All of this is to say that the sort of programs that run best on modern hardware, in modern computing environments, can be difficult or impossible to express in old languages. To address this, older languages are revised and updated; Fortran has seen major revisions roughly every decade, as has C, and C++ seems to be on a 3- or 4-year cadence. 33todo: 3need a reference?

The modernization of these languages can help address many problems, but it largely cannot retroactively, automatically modernize older codes. That mass of ‘legacy’ code has inertia and—if the analogy may be stretched a little more—perhaps even a gravitational field; new codes related to a long-lived project may be developed in an older variant of a language to maintain compatibility.

Of course, writing a code from scratch in a modern language doesn’t guarantee a straightforward, simple solution. The complexity of modern hardware is such that complex software solutions are sometimes necessary to achieve the highest level of performance. Compilers have become extremely powerful and remain essential tools for developing efficient codes; even so, they are not always capable of transforming user code into the most efficient implementation for a given machine. This is where intermediate tools—optimization frameworks, libraries, domain-specific languages (DSLs) and source-to-source techniques like the one we present here—can be most useful.

1.2 Contributions

We present a method for generating efficient computations on regular grids based on user-specified code and rules. At its core, our technique aggressively fuses user-provided loops—potentially nested—to maximize locality and re-use of data; it is also able to reduce or eliminate storage of intermediate values, and it is capable of further transformations to enable higher-performance vectorization.

We have developed a prototype tool named High-performance Fusion And Vectorization (HFAV) that implements the techniques described in this article that accepts declarative input in the form of initial ‘axioms’, terminal ‘goals’, and a collection of production rules that describe individual kernels in the code and their data dependencies. HFAV is then able to perform inference on this logical system to discover the dimensionality and dataflow of the program, to which the above techniques are applied to produce generated code that invokes the user-specified kernels. This is amenable to integration with an outer level of thread-level or SPMD-level parallelism, and our prototype has complete backends in C and C++ with preliminary support for Fortran.

This prototype has been open-sourced and released; see Section 4 for details. inlineinlinetodo: inlineTease results

2 Related Work

Transforming user code into a form that is highly-tuned for a specific hardware architecture is an active area of research. This is especially true in the domain of stencil codes, where operators and data access patterns are simple to define and characterize, but difficult to map to hardware such that they run efficiently (i.e. accounting for node/thread/vector parallelism and all levels of the memory hierarchy). As a result, the developers of stencil codes have a variety of actively developed DSLs and tools to choose from—such as STELLA [8], Pochoir [15], PATUS [4], OPS [12], PLuTo [3], and YASK [16]—which are capable of automating parallelization, cache blocking, vectorization and changes in data layout. Our technique employs optimizations that are orthogonal to those implemented in existing stencil frameworks, and assumes that something else (i.e. the developer or another tool) is responsible for parallelization and data layout; we believe that the optimizations presented in this article are compatible with existing frameworks, but exploration of this is beyond the scope of this work.

The loop fusion and storage/bandwidth reduction optimizations from our technique are similar to some that can be found in other research, but they differ in their implementation. Compiler passes that perform these optimizations automatically have been implemented [9, 7], but their usage is limited to cases in which the compiler is able to recognize that the optimization is both valid and beneficial. Both the OPS framework [13] and the loop chain directives developed by Bertolacci et al. [2] sidestep this problem by fusing loops based on user-provided data dependency information, but neither specifically targets the optimization or removal of intermediate storage locations. The “storage folding” optimization in Halide [11] is most similar to that discussed here, with the key difference being our technique’s specific consideration for efficient vectorization.

To the best of our knowledge, the use of an inference system to discover and satisfy data dependencies that is found in our prototype implementation is novel. However, the use of inference systems in general is not novel in this space. One example is the inference system used by PetaBricks [1] to construct algorithms from device- and context-specific building blocks. HFAV’s inference system is currently much simpler that PetaBricks’—allowing for only one function to produce a given output—and would likely need to be extended if our implementation were to support a wider range of architectures.

3 Design

Some terminology we use in this article is informed by the declarative, rule-based front-end found in our prototype implementation. While this front-end is useful, the fusion and transformation techniques at the core of this method do not depend on it; we describe our prototype and its front-end, as well as other front-ends in progress, in Section 4.

To help convey the key concepts of our technique, we employ various example codes and demonstrate their transformation process through the code generator. The 5-point Laplace stencil is one such example; this is a common finite-difference discretization of the Laplace operator in two spatial dimensions. Listing 1 shows the Laplace stencil used in a successive-over-relaxation (SOR) method.

void laplace5(float n, float e, float s, float w,
              float c, float* out)
{
  *out = (1.0f-om) * c + om/4.0f*(n + e + s + w);
}
void apply_laplace5(int N, const float in[][N+2],
                    float out[][N+2])
{
  for (int j = 1; j < N+1; ++j)
    for (int i = 1; i < N+1; ++i)
      laplace5(in[j-1][i], in[j][i+1], in[j+1][i],
               in[j][i-1], in[j][i], &out[j][i]);
}
Figure 1: Simple code for the 5-point Laplace stencil

We also use a code that performs one-dimensional flux differences on a system of equations that needs to normalize each computed flux vector. This example is greatly simplified from real-world hydrodynamics codes, but shows important data flow concepts.

3.1 Fundamentals

Our technique operates on codes that apply homogeneous computations over iteration spaces—multidimensional outer products of series of equally-spaced integers. In imperative terms, we operate on codes that call kernels in successive nested ‘simple’111

‘Simple’ here means that each loop advance a single integral counter by a fixed integral stride. The number of trips must not depend on other enclosing or enclosed loops or kernel computations.

loops. We reference the dimensions of this iteration space with familiar loop labels such as , , and .

While the computations found in these nested loops need not be independent of one another—either within a loop nest or across them—our transformations require that the code be agnostic to the order in which a given kernel is applied; this freedom to re-order computations is critical in transforming the code. This implies that we do not allow kernels to have ‘side-effects’ (i.e. effects on global program state).

Kernels are described in terms of their data accesses/outputs against a canonical frame of reference. So, while the Laplace code shown above accesses a 2D data grid of size , it is convenient to talk about it in a translation-free sense as an operator using relative offsets. This is in fact very much the way the invocation to the laplace5 appears in Listing 1.

When performing fusion analysis, we impose a user-selected global loop ordering on the code: while kernels can be expressed in an order-free listing of iteration space depth (e.g., or ), ultimately they are arranged to match the global order to enable fusion.

Our transformation technique follows the following steps:

  1. Form dataflow graph

  2. Build iteration nests

  3. Build iteration nest graph

  4. Fuse iteration nest graph

  5. Analyze variables
    (in-outs, reuse, contraction, vectorization)

  6. Emit code

These steps are described in detail below.

3.2 Dataflow

We begin with a dataflow graph provided by the front-end; this is a directed acyclic graph (DAG) with kernel callsites as vertices and intermediate variables forming the edges. Such a graph for the Laplace example is shown in Figure 2. Note that we have added pseudo-kernels such as ‘load’ and ‘store’ to handle terminal references.

Figure 2: Dataflow DAG for the 5-point Laplace stencil.
Figure 3: Dataflow DAG for the normalization example.

A significant amount of information may be gleaned from this graph; in addition to informing the eventual code generation (through a topological traversal of this graph), we can determine the iteration space for each kernel callsite by taking the union of all iteration spaces found on incident variables.

To enable fusion and other transformations, we will use this graph to build a data structure more suited to the task. First, we introduce a building block.

3.2.1 Iteration nests

An iteration space describes all trips taken by a set of tightly nested loops, such as the loops found in Listing 1 where there is no code between the two for statements. This is a useful construct, but it cannot describe some iteration patterns that are useful to manipulate.

A common pattern in performing iteration is to perform some work before a loop (the prologue), the main loop body itself (the steady-state), and to perform some work after the loop (the epilogue). We refer to these steps as the phases of the iteration nest. A single iteration nest has an associated identifier, range, and stride that is used to to generate code and to determine compatibility during fusion.

The prologue can be omitted if the kernel invocations and iteration nests therein are the same as those in the steady-state; the same applies to the epilogue; and when all phases are identical, the ‘perfect’ iteration nest directly corresponds to an iteration space. Note that it is possible to permute the nesting of perfect iteration nest, which is equivalent to permuting the order of loops; this is useful for imposing a global loop nesting order on kernel callsites.

The general form of an iteration nest with distinct phases and recursive structure (i.e. other iteration nests in one or more of those bodies) takes the form of a -ary tree. It is sometimes useful to refer to the maximum depth of a given tree when performing fusion.

3.2.2 Iteration nest DAG

The dataflow graph’s structure can generate the iteration space for each kernel callsite (see Section 3.2); this can be used to initialize a perfect iteration nest that we will manipulate in later stages. We form the initial iteration nest DAG by first aggregating certain kernel callsites (see ‘Grouping’, below) by merging vertices, and then creating the aforementioned perfect iteration nests from those groups with callsites of the innermost nest steady-states. See Figure 4 for the initial iteration nest DAG from the normalization example.

Grouping

While it is possible to construct the iteration nest DAG from the dataflow DAG directly (without merging any vertices) it makes sense to combine related kernel callsites. Specifically, we examine the kernel callsites and group them based on: (1) matching kernel names, and (2) matching parameter lists. A parameter list is said to match if the parameters are identical in value and position except for the spatial displacements. So {load(cell[i][j]), load(cell[i+1][j]), load(cell[i-1][j]), load(cell[i][j-1])}, load(cell[i][j+1])}—differing only in the displacements in i and j—are grouped together, while {load(cell[i][j]), load(aux[i+1][j])} would not be.

Figure 4: Initial iteration nest for the normalization example.

3.3 Fusion

Once the iteration nest DAG is constructed, we can begin to fuse the iteration nests located at vertices. Fusion has two processes: (1) an outer level where certain pairs of vertices of the iteration nest DAG are examined to determine if they may be fused; and (2) an inner level where, if fusion is to occur, the two iteration nests at said vertices are fused.

3.3.1 Fusing the iteration nest DAG

The outer level of this process is a traversal of the iteration nest DAG that maintains a ‘fusing’ vertex (which may be the fused aggregate of many of the vertices the original inest DAG); fusion is attempted across each incoming edge, which eventually results in the exhaustion of incoming edges or an unfusable split (see below). Pseudocode for this procedure is given in Listing 5.

procedure fuse_inest_dag(inest_dag)
  for vert in topo_sort_vertices(inest_dag)
    for iedge in incoming_edge_list(vert)
      fuse_result = fuse_inest(vert,
                               incoming_term(iedge))
      if fuse_result == unfusable
        for edge in sep_edges(incoming_term(iedge),
                              vert)
          split(edge)
      else
        vert = fuse_result
Figure 5: Algorithm for iteration nest DAG fusion

The traversal must occur in a topological order to ensure that we do not introduce cycles when fusing vertices. When an edge is identified as unfusable, we identify all vertices reachable from the candidate vertex and cut the iteration nest DAG along the separating edges between that subgraph and its complement.

The results of fusion applied to the iteration nest DAG from Figure 4 are shown in Figure 6. Here, the entire iteration nest DAG fused into a single iteration nest.

Figure 6: Fused iteration nest for the normalization example.

3.3.2 Fusing iteration nests

The fuse_inest() procedure in Figure 5 handles the details of fusing iteration nests.

Rank ordering

First, recall (see Section 3.1) that we have established a global ranking of loop identifiers (e.g., a triple-nested loop might have identifiers ordered as : is rank 2 and outermost, is rank 1, and is rank 0 and the innermost). By construction, the identifiers within iteration nests obey this order, and each iteration nest has an equal to the rank of its outermost identifier—e.g., following the previous example, a double-nested iteration nest over and would have , and a double-nested iteration nest over and would have .

Dataflow ordering

Second, recall that each iteration nest contains—in its leaves—nodes from the dataflow DAG. For any two iteration nests, or phases within iteration nests, we can examine the relative ordering of the dataflow DAG subgraphs induced by these iteration nests. Given two such subgraphs and of the dataflow DAG , we can ask if , which asks if each node of can be topographically ordered before each node of in . For two such subgraphs, there are four possiblities:

  1. = true and = true: and are indepdendent of one another in and maybe be relatively ordered in any fashion.

  2. = true and = false: must be ordered before .

  3. = false and = true: must be ordered before .

  4. = false and = false: and have a cycle between them and cannot be ordered; even though is acyclic by definition, subgraphs of it need not maintain that property.

Fusion operates on pairs of iteration nests (we shall say and ) using these concepts of rank ordering and dataflow ordering, and is applied recursively—a natural consequence of the recursive nature of iteration nests. First, we compare the rank order: and ; this determines how kernels in the phases of and may be combined. When the ranks match (i.e., the outer-most identifiers are the same), we recursively fuse each phase of and directly (i.e. we fuse the prologue of to the prologue of , we fuse the steady-state of to the steady-state of , and we fuse the epilogue of to the epilogue of ), subject to the ability to find a compatible dataflow order, as described above.

When the ranks of and differ, we fuse the lower-ranked iteration nest’s phases into the higher-ranked iteration nests’s prologue or epilogue, depending on the dataflow ordering. In any case, if we are unable to establish an order between iteration nests, we have identified the need to split. Pseudocode for iteration nest fusion is given in Listing 7.

procedure fuse_inest(A, B)
  // irank() gives the rank of the outermost loop
  // Here, it is used to (try to) determine a nesting
  // order
  diff = irank(A) - irank(B)
  if diff == 0
    // dataflow_le() returns ’true’ if the second
    //  argument can be toplogocially ordered after
    //  the first
    // {prologue,epilogue}_only() return the kernel
    //  callsites of
    //  the all children in the {prologue,epilogue}
    //  minus those in the steady_state()
    if dataflow_le(prlg_only(A), steady(B)) and
       dataflow_le(prlg_only(B), steady(A)) and
       dataflow_le(steady(A), eplg_only(B)) and
       dataflow_le(steady(B), eplg_only(A))
       // inest(iter_ident, [prlg, steady, eplg])
       // creates a new iteration nest with the given
       // identifier and phases
       return inest(iident(A),
                    [fuse_inest(prlg(A), prlg(B)),
                     fuse_inest(steady(A), steady(B)),
                     fuse_inest(eplg(A), eplg(B))]
    else return unfusable
  else if diff < 0:
    swap(A, B)
  before? = dataflow_le(all_phases(A), steady(B))
  after? = dataflow_le(eplg(B), all_phases(A))
  if before?
    // if after? is true, fusion order is ambiguous
    return inest(iident(A),
                 [fuse_inest(all_phases(A), prlg(B)),
                  steady(B),
                  eplg(B)]
  else if after?
    return inest(iident(A),
                 [prlg(B),
                  steady(B),
                  fuse_inest(all_phases(A), eplg(B))]
  else return unfusable
Figure 7: Algorithm for recursively fusing two iteration nests. Invoked from fuse_inest_dag in Figure 5.

3.4 Reductions & broadcasts & splits

Most real codes are relatively simple to fuse: for example, a 7-point Laplace stencil might consume three-dimensional data from different offsets and produce three-dimensional data. Flux differencing codes consume adjacent pairs of three-dimensional data to compute fluxes, then those fluxes are used in pairs or triples for limiting and integration into a three-dimensional output. When the inputs and outputs to kernels are all of the same dimension, their loops don’t require special handling.

However, there are real use cases where a kernel might consume one-dimensional data to produce three-dimensional data, or vice versa. These require special consideration. We refer to kernels that transform lower-dimensional data to higher-dimensional data as broadcasts and to those that fransform higher-dimensional data to lower-dimensional data as reductions.

Broadcasts can be handled by fusing the producer of the lower-dimensional data into the prologue of one of the consumers’ iteration nests.

Reductions require more consideration because they involve many writes to the same data; because of our iteration-order-agnostic assumptions, valid reduction kernels must implement associative operators. They are typically found in a three-step pattern spread across three kernels: first, an initialization kernel sets the accumulation data to the identity of the associative operator, then the associative kernel accumulates over that data, and last a finalization kernel is run to finish the result. For example, in computation of the mean across a one-dimensional array, you would first set an accumulator to zero, then iterate over the array, adding entries to the accumulator and counting steps, and finally you would divide the accumulator by the number of steps taken. This generalizes to averages across columns or rows of two-dimensional data and so on.

These triples fit nicely into the phase scheme outlined earlier: initialization forms the prologue of the fused iteration nest, the associative kernel forms the steady-state, and the finalization forms the epilogue.

Splits

Naturally, it is desirable to fuse iteration nests containing broadcasts and reductions as much as possible to promote locality of reference and reuse. For most cases, iteration nests containing reductions and broadcasts can be handled as any other; the exception is when a broadcast consumes the result of a reduction. Such regions have dataflow from higher dimensions to a lower one (the reduction), and then back up to a higher dimension (the broadcast). We refer to this as concave dataflow and the need to handle the lower-dimensional data property prevents the two higher-dimensional iteration nests (which need not be of the same depth) from being fused. The result is a split.

The identification of a split not only prevents the two iteration nests that gave rise to the split from being fused, but requires that we cut the dataflow graph to separate the two subgraphs—one upstream, one downstream—induced by the identified split. Those kernels downstream of the split depend on the results therein, and therefore we must separate the graph completely.

3.5 Variables and storage

The variable references from the dataflow DAG can be used to determine the storage necessary to execute each kernel correctly. The size of each iteration nest, combined with the total offsets used to reference each variable, can be used to determine the size of each dimension of access.222This is similar to the Minkowski Sum found in geometry.

One advantage of fusion is that it allows for storage to be reduced because of improved locality of reference; we can compute the precise number of times a value is used and schedule loops to minimize intermediate storage.

Enclosing

In most cases, variables end up having limited scope due to their producers and consumers being fused into the same iteration nest. However, when fusion is not possible (because of splits, see above) data must be kept live across distinct iteration nests.

To effect this, we analyze dataflow in the fused iteration nest to find the narrowest region of liveness to use for each variable; we refer to this as the enclosing region for the variable. Most such regions are trivial, being internal to a single iteration nest, but others span multiple iteration nests.

Grouping

The first step in reducing storage of intermediate values is to identify reuse patterns. This can be done by examining all variable references used as inputs and aggregating those going to the same identifier but with distinct references. For example, a 5-point Laplace stencil access 5 offsets in a single array: .

With references aggregated, we are then able to use the iteration nest’s iterator order to find reuse patterns for each reference. In the Laplace example, if we proceed with iterating in , order in linear fashion, we know that when we access , it is the first time we have visited that value as input. The next time we access that variable will be at , and it will be accessed next by , and finally before it is not needed again.

1-10ji10
Figure 8: Reuse diagram for the 5-point Laplace stencil. Arrows show reuse patterns for an progression; the orange arrows show the Hamiltonian path of reuse.

Figure 8 shows this reuse pattern. This pattern can be produced procedurally in graph form for each identifier by a three-step process: (1) creating vertices for each reference to ; (2) adding an edge from to where appears before in the order induced by the iteration ordering; and finally, (3) computing the longest path in the graph, which will produce a Hamiltonian path covering all nodes in the order that they will be reused.

Contraction

With reuse patterns computed, it is possible to contract the storage needed for each input variable by determining, for each identifier in an iteration nest, the distance in iterations between the first and last reference for each variable.

For a 1D, 3-point Laplace example that references the array at , grouping analysis shows us that the first time we see a value in the steady-state is at . It is reused by and then , and then isn’t needed again.

Naïve analysis would suggest that we need a buffer for that covers the entire iteration space for the Laplace operator, plus 2 for the and . Distance computations from the contraction analysis show that storage for only 3 values is necessary, and that rotation of the values or use of a circular buffer can manage the updates to storage. See Figure 8(a) for a diagram of this pattern.

(a) Rotation scheme for data in the innermost dimension.
(b) Rotation scheme for data in a dimension that is not the innermost.
(c) Vectorized rotation scheme for data in the innermost dimension, for a vector length of 4.
Figure 9: Rotation schemes for storage contraction.

The same principle can be applied to multidimensional references. For the 5-point Laplace stencil in 2D, the reuse distance for the inputs to the stencil on a variable given iteration order span to —essentially 2 times the storage necessary for a single -row. We can produce a circular buffer or rotation scheme to cover this, although it is generally most practical to simply allocate 3 times the storage needed for a single row and to quickly rotate those rows through pointer manipulation. Figure 8(b) shows this ‘outer’ rotation scheme.

In/out chaining

In most cases, the dataflow graph contains terminal data references that refer to storage external to the analysis and transformation process described here; these are effectively the inputs and outputs to the transformed code. It is possible for inputs to alias outputs (for example, in an in-place stencil update)—the user must report this as part of the transformation input, since our default assumption is that variables do not alias one another.

To preserve correct behavior in the presence of aliasing, we use the dataflow graph to chain from each terminal input across kernel inputs and outputs to find which terminal outputs depend on which terminal inputs. Then, if there is aliasing in interdependent terminals, we determine which input offsets must be copied to temporaries before output can be written back safely.

Vectorization

Vectorization is a powerful tool for improving performance on modern processors, and the assumptions that we impose upon inputs to our transformation technique (order-independent execution, no aliasing, no side effects) generally are very compatible with vectorization.

However, one challenge is that while vectorization is possible for these codes, it is often not helpful when performance is bound by memory bandwidth; this is typical of codes with disparate loops with multiple streams and long reuse distances—precisely the kind of code that our technique seeks to transform.

Fusing loops, reducing storage footprint, and reducing reuse distance can improve the arithmetic intensity of code to the point where vectorization is beneficial, and generally doesn’t impede the process of vectorization.

One hiccup here is that contracting storage (see above) into circular buffers results in data access patterns incompatible with vectorization; in this case, the circular buffers are expanded by a target vector length and vector code to perform an in-place rotation of the buffer can be emitted. See Figure 8(c) for a visual depiction of how circular buffers expand.

3.6 Code Generation

With the fused iteration nest DAG and full variable analysis completed, code generation can begin.

We visit the iteration nest DAG in topological order, emitting enclosing regions and their variable definitions as we encounter them. We emit a loop body for each iteration nest phase and recursively traverse children in an in-order fashion. Any kernel invocations in a phase are emitted according to the topological order found in the original dataflow graph, with appropriate variable references substituted.

When leaving iteration nests, the loops are closed and any necessary variable rotations are emitted.

4 Implementation

Our prototype implementation of the methods described in this paper—named HFAV—is a research-quality tool that nonetheless is able to produce compelling and even surprising results. HFAV is written in Python for quick development and access to external libraries, but could easily be imagined in a lower-level, more performant language in a production environment.

We use a custom YAML format for handling user input. The front-end of our tool may appear to accept C-like input, and that may be a convenience for users coming from C, but it is foremost simply a way of declaring types, functions, and dependencies, and the mechanics of the way code is handled in the front-end are purely based on substitution. HFAV only needs to know the positions of arguments and the function name to emit compilable code.

The code emitted by HFAV can be included directly into programs or treated as stand-alone source files. The backend languages supported by our tool are C++ and C99, with some work on a Fortran 90 backend begun.

kernels:
 laplace:
   declaration: laplace5(float n, float e, float s,
                         float w, float c, float &o);
     inputs: |
       n : q?[j?-1][i?]
       e : q?[j?][i?+1]
       s : q?[j?+1][i?]
       w : q?[j?][i?-1]
       c : q?[j?][i?]
     outputs: |
       o : laplace(q?[j?][i?])
globals:
  inputs: |
    float g_cell[j?][i?] => cell[j?][i?]
  outputs: |
    laplace(cell[j][i]) => float g_cell[j][i]
Figure 10: YAML inference system for the 5-point Laplace stencil

4.1 Inference

HFAV analysis begins with a dataflow graph we refer to as the ‘inference DAG’, or IDAG. This is a directed graph with individual data accesses—concrete terms—as vertices and function calls and rule applications or RAPs—as edges. Input terms form the roots of the IDAG, and output terms form the leaves.

RAP duals

It is possible to define a dual of the IDAG that has RAPs as vertices and the terms interchanged between them as edges; we refer to this as the RAP dual. This is in fact the same graph referred to as the ‘dataflow DAG’ in Figure 2.

Emitting code

The availability of auto-vectorizing compilers and directives (like the omp simd support found in OpenMP 4.0 [10]) means that our transformation can emit scalar loops that the compiler is able to vectorize. For the cases where some special rotations are requires (see Secs.3.5 and 3.5), we provide snippets of the appropriate language with intrinsic sequences for various instruction sets.

Debugging output

One advantage of a Python implementation is that it is possible to easily emit various graphical outputs showing reuse patterns and dataflow; HFAV is capable of displaying these graphs at the users’ request and is the basis for many of the figures in this article.

Availability

We have released this prototype as an open-source project under the Apache License 2.0: https://github.com/intel/HFAV.

5 Performance

To demonstrate the utility of HFAV in improving performance for bandwidth-bound codes, we present performance results and analysis for three different applications: the simple normalization example discussed throughout this article; a diffusion stencil; and a shock hydrodynamics benchmark.

5.1 Experimental Setup

SKX KNL
Ref. Clock (GHz) 2.1 1.4
L1 / L2 /L3 Cache (KB) 32 / 1024 / 32768 32 / 512 / -
Peak DP GFLOP/s 3225 3046
STREAM Bandwidth (GB/s) 200 90 (DDR)
490 (MCDRAM)
Table 1: System configuration. L1 and L2 sizes are per-core; L3 is per-socket.

The system configuration used for all of our experiments is given in Table 1.

We use the Intel® Compiler version 18.0, using the -O3 -xHost flags to request optimizations specific to the host on which the code is compiled. We use all cores available on SKX and 64 cores on KNL (leaving 4 cores free for the operating system), launching two threads per core on both systems. Thread affinities are set with KMP_AFFINITY=compact,granularity=fine (to control thread placement) and KMP_HW_SUBSET=${cores},2t (which limits thread placement to specific subsets of logical cores).

The KNL is configured in ‘Quad/Flat’ mode: its DDR and MCDRAM memories are exposed to software as two separate NUMA nodes, and the user must explicitly request allocations backed by MCDRAM (e.g. through use of libnuma and its numactl utility). We present results using MCDRAM only, thus highlighting that HFAV’s data locality and bandwidth usage optimizations remain relevant even in the presence of the high-bandwidth memories available on some architectures.

5.2 Normalization Example

After fusion (see Figure 6), the normalization example requires two loop nests: one containing the flux computation, norm accumulation and norm root; and another containing the final divisions and normalization operation. The split between these two nests (required because of the reduction) prevents HFAV from performing array contraction optimizations—the data consumed by the second nest is produced by the first.

The performance improvement arising from HFAV in this instance is therefore a direct result of loop fusion, which reduces the number of times that the entire space is visited from five to two. As shown by the graph in Figure 12, this reduction in bandwidth requirements is reflected by the performance improvement that we see for large problems that do not fit in cache.

Figure 11: Performance of different implementations of the COSMO micro-kernels for Intel® Xeon PhiTM 7250. ‘STELLA’ is the code from Gysi et al., ‘HFAV’ is the (otherwise unmodified) output of our prototype generator, and ‘HFAV + Tuning’ is the result of manual improvements to the output of ‘HFAV’.

5.3 COSMO Micro-Kernels

In [8], Gysi et al. use a two-dimensional fourth-order diffusion stencil—a proxy for operations arising in the dynamical core of the COSMO atmospheric model [6]—to demonstrate the functionality of the STELLA embedded DSL. The stencil is applied over three-dimensional data (with no dependencies in one of the dimensions) and consists of four kernels performing very few floating-point operations per cell.

The four kernels are as follows:

  • ulapstage: 5-point Laplace operator, computed using the value for a cell and its neighbors in and

  • flux_x: flux computation in , using the values and the results of ulapstage for two neighboring cells in

  • flux_y: flux computation in (as above, but in )

  • ustage: integration, using the value for a cell and the four neighboring fluxes in and

The graph in Figure 11 shows performance results for several variants of the COSMO micro-kernels executing on KNL. An optimized ‘STELLA’ version provided by Gysi et al. fuses the final three kernels, with the fluxes computed redundantly for each cell. The ‘HFAV’ version merges all four kernels, using rolling buffers of sizes 2 and 3 for the fluxes and Laplacians respectively—in addition to reducing bandwidth requirements, memory footprint is reduced from to . Introducing an additional rolling buffer for the input values would permit the entire operation of this benchmark to be performed in-place, further reducing the memory footprint by ; however, we omit this optimization from our study since we do not know whether it can be safely applied to COSMO.

(a) Throughput on 2 sockets of Intel® Xeon® Platinum 8160.
(b) Throughput on the Intel® Xeon PhiTM 7250.
Figure 12: Performance of different implementations of the normalization example code for various platforms. ‘autovec’ is the unmodified code (subject to whatever auto-vectorization optimizations the compiler performs), while ‘HFAV’ is the output of our prototype generator.

As before, the performance impact of HFAV is largest for larger problem sizes, where intermediate results do not fit in cache. In order to remain representative of COSMO, all problem sizes considered here are quite small—relative to the degree of parallelism available in modern machines—and therefore overheads introduced by core-to-core communication (i.e. synchronization) and the software pipeline are exposed, limiting performance. Our prototype assumes that the amount of work per loop iteration and the number of pipeline stages are both sufficiently large to amortize the costs of priming the software pipeline and accessing intermediate buffers. This is not true for the COSMO micro-kernels when run on small problems, and we were able to improve upon our results by manually tuning the generated code (’HFAV + Tuning’)—specifically, by using intrinsics to force alignment and instruction masking to fold the prologue/epilogue into the steady-state. The manual nature of these optimizations reflects the immaturity of our prototype, and its current reliance on the compiler to auto-vectorize generated loops efficiently.

5.4 Hydro2D

Hydro2D is a two-dimensional shock-hydrodynamics benchmark developed by CEA [5]. The rolling optimizations performed by HFAV were first manually developed and tested in the context of Hydro2D, and we refer the reader to our previous work [14] for more details of that effort.

Hydro2D implements nine kernels:

  • make_boundary: Set the boundary conditions.

  • constoprim: Convert unknowns from ‘conservation form’ to ‘primitive form’.

  • equation_of_state: Complete the system of primitive equations.

  • slope: Approximate derivatives of the averaged solution.

  • trace: Limit slopes appropriately.

  • qleftright: Split components according to cell boundaries.

  • riemann: Solve Riemann problem at each boundary.

  • cmpflx: Compute conservative fluxes based on Riemann solution.

  • update_cons_vars: Integrate conservative variables in time.

The operator is dimensionally split (i.e. it is applied in one dimension at a time), and each of its kernels therefore have dependencies in only one dimension. Since the dimension in which a dependency occurs is different for each pass, HFAV effectively requires the user to specify the dependency information twice.

HFAV is able to fuse all nine of Hydro2D’s kernels into a single loop nest, removing a significant number of intermediate buffers via array contraction in the process. Storage for all but the four primitive variables (, , and ) can be replaced by rolling buffers with a maximum of 5 stages: memory footprint is thereby reduced from to in both the and passes.

(a) Throughput on 2 sockets of Intel® Xeon® Platinum 8160.
(b) Throughput on the Intel® Xeon PhiTM 7250.
Figure 13: Performance of different implementations of the Hydro2D code for various platforms. ‘autovec’ is the unmodified code (subject to whatever auto-vectorization optimizations the compiler performs), ‘handvec’ is a manual vectorization using intrinsics, and ‘HFAV’ is the output of our prototype generator.

The graphs in Figure 13 show that HFAV is able to achieve performance similar to those of our manual efforts (from [14]); for problems where a significant number of kernels can be fused, the instruction overheads that effected the COSMO micro-kernels can be amortized. Not reflected in the graph is the shift in architectural bottleneck after fusion; fusing all of the kernels and contracting the vast majority of the arrays moves Hydro2D from being bandwidth-bound to compute-bound—indeed, our generated code has almost identical performance on KNL whether using DDR or MCDRAM memory (results not shown).

6 Summary

We have presented a technique for transforming input code consisting of kernels and loops into fused, vectorized kernels that can have more compact footprints and better data reuse properties. Our technique integrates with outer-level parallelization schemes such as OpenMP or MPI, and we have shown sizable speedups for important proxy codes on state-of-the art hardware.

We hope our technique lays the groundwork for expanded and more powerful code generation/transformation tools.

Limitations & Future Work

Some of the assumptions underlying our technique can be restrictive, and we would like to see future efforts address these restrictions. In particular, we would like to see direct support for automatic domain decomposition and threading, treatment of intermixed loop orderings, and the ability to fuse codes on unstructured grids.

We believe that the data dependency information required by HFAV could be expressed via directives and integrated into existing codes—similar to the approach in [2]—and are interested in exploring this approach.

References

  • [1] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. PetaBricks: A Language and Compiler for Algorithmic Choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’09, pages 38–49, New York, NY, USA, 2009. ACM.
  • [2] I. J. Bertolacci, M. M. Strout, S. Guzik, J. Riley, and C. Olschanowsky. Identifying and scheduling loop chains using directives. In 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), pages 57–67, Nov 2016.
  • [3] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2008.
  • [4] M. Christen, O. Schenk, and H. Burkhart. PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 676–687, May 2011.
  • [5] Guillaume Colin de Verdière. A 2d hydro code for benchmarking purpose, 2017. Accesssed September 2017.
  • [6] Consortium for Small-Scale Modelling, 2017. Accessed September 2017.
  • [7] Tobias Grosser, Armin Groesslinger, and Christian Lengauer. Polly–Performing Polyhedral Optimizations on a Low-level Intermediate Representation. Parallel Processing Letters, 22(04):1250010, 2012.
  • [8] Tobias Gysi, Carlos Osuna, Oliver Fuhrer, Mauro Bianco, and Thomas C. Schulthess. STELLA: A Domain-specific Tool for Structured Grid Methods in Weather and Climate Models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pages 41:1–41:12, New York, NY, USA, 2015. ACM.
  • [9] J. Ng, Dattatraya Kulkarni, W. Li, R. Cox, and S. Bobholz. Inter-procedural loop fusion, array contraction and rotation. In 2003 12th International Conference on Parallel Architectures and Compilation Techniques, pages 114–124, Sept 2003.
  • [10] OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0, July 2013. Accessed September 2017.
  • [11] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 519–530, New York, NY, USA, 2013. ACM.
  • [12] István Z Reguly, Gihan R Mudalige, Michael B Giles, Dan Curran, and Simon McIntosh-Smith. The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations. In Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), 2014 Fourth International Workshop on, pages 58–67. IEEE, 2014.
  • [13] István Z. Reguly, Gihan R. Mudalige, and Mike B. Giles. Loop tiling in large-scale stencil codes at run-time with OPS. CoRR, abs/1704.00693, 2017.
  • [14] Jason D. Sewall and Guillaume Colin de Verdière. From ’correct’ to ’correct & efficient’: A hydro2d case study with godunov’s scheme. In High Performance Parallelism Pearls, chapter 2, pages 7–42. 2015.
  • [15] Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. The Pochoir Stencil Compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’11, pages 117–128, New York, NY, USA, 2011. ACM.
  • [16] C. Yount, J. Tobin, A. Breuer, and A. Duran. YASK–Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning. In 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), pages 30–39, Nov 2016.

7 Disclaimers

Intel, the Intel logo, Intel Xeon, Intel Xeon Phi and Intel VTune are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Intel does not control or audit third-party benchmark data or the other papers referenced in this document. You should visit the referenced documents and confirm whether referenced data are accurate.

Acknowledgements

The authors would like to thank the members of the HEAT team at Intel for their support of this project, as well as Guillaume Colin de Verdière of CEA and Grzegorz Kwasniewski, Tobias Gysi, and Torsten Hoefler of ETH Zurich for their help with application codes.