HFAV
None
view repo
We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware. We introduce representations for the abstract relationships and data dependencies of kernels in loop nests and algorithms for manipulating them into more efficient form; we similarly introduce techniques for determining data access patterns for stencil-like array accesses and show how this can be used to elide storage and improve vectorization. We discuss our prototype implementation of these ideas---named HFAV---and its use of a declarative, inference-based front-end to drive transformations, and we present results for some prominent codes in HPC.
READ FULL TEXT VIEW PDFNone
We present a technique for automatically transforming kernel-based computations in disparate, nested loops into a fused, vectorized form that can reduce intermediate storage needs and lead to improved performance on contemporary hardware.
We introduce representations for the abstract relationships and data dependencies of kernels in loop nests and algorithms for manipulating them into more efficient form; we similarly introduce techniques for determining data access patterns for stencil-like array accesses and show how this can be used to elide storage and improve vectorization.
We discuss our prototype implementation of these ideas—named HFAV—and its use of a declarative, inference-based front-end to drive transformations, and we present results for some prominent codes in HPC.
Hardware evolves, and languages can evolve with it, but codes remain static unless rewritten. This article presents a method for automatically generating stencil-like codes—based on existing code and simple rules—that run well on modern hardware.
The longevity of many programming languages has been both a boon and a hindrance to computing. Fortran, famously the first high-level programming language^{1}^{1}todo: 1need a reference?, has been in continuous use since 1957. C and C++ first appeared in 1972 and 1983, respectively^{2}^{2}todo: 2need a reference?. Billions of lines of code have been written in these languages, providing a huge library of code for posterity, and their longevity means that developers have had many years in which to become familiar with the languages.
These languages were developed on and for hardware long deemed obsolete, in computing environments that would be nearly unrecognizable to most programmers today. New ideas about programming have come (and some have gone) during their lifetimes. All of this is to say that the sort of programs that run best on modern hardware, in modern computing environments, can be difficult or impossible to express in old languages. To address this, older languages are revised and updated; Fortran has seen major revisions roughly every decade, as has C, and C++ seems to be on a 3- or 4-year cadence. ^{3}^{3}todo: 3need a reference?
The modernization of these languages can help address many problems, but it largely cannot retroactively, automatically modernize older codes. That mass of ‘legacy’ code has inertia and—if the analogy may be stretched a little more—perhaps even a gravitational field; new codes related to a long-lived project may be developed in an older variant of a language to maintain compatibility.
Of course, writing a code from scratch in a modern language doesn’t guarantee a straightforward, simple solution. The complexity of modern hardware is such that complex software solutions are sometimes necessary to achieve the highest level of performance. Compilers have become extremely powerful and remain essential tools for developing efficient codes; even so, they are not always capable of transforming user code into the most efficient implementation for a given machine. This is where intermediate tools—optimization frameworks, libraries, domain-specific languages (DSLs) and source-to-source techniques like the one we present here—can be most useful.
We present a method for generating efficient computations on regular grids based on user-specified code and rules. At its core, our technique aggressively fuses user-provided loops—potentially nested—to maximize locality and re-use of data; it is also able to reduce or eliminate storage of intermediate values, and it is capable of further transformations to enable higher-performance vectorization.
We have developed a prototype tool named High-performance Fusion And Vectorization (HFAV) that implements the techniques described in this article that accepts declarative input in the form of initial ‘axioms’, terminal ‘goals’, and a collection of production rules that describe individual kernels in the code and their data dependencies. HFAV is then able to perform inference on this logical system to discover the dimensionality and dataflow of the program, to which the above techniques are applied to produce generated code that invokes the user-specified kernels. This is amenable to integration with an outer level of thread-level or SPMD-level parallelism, and our prototype has complete backends in C and C++ with preliminary support for Fortran.
This prototype has been open-sourced and released; see Section 4 for details. ^{inline}^{inline}todo: inlineTease results
Transforming user code into a form that is highly-tuned for a specific hardware architecture is an active area of research. This is especially true in the domain of stencil codes, where operators and data access patterns are simple to define and characterize, but difficult to map to hardware such that they run efficiently (i.e. accounting for node/thread/vector parallelism and all levels of the memory hierarchy). As a result, the developers of stencil codes have a variety of actively developed DSLs and tools to choose from—such as STELLA [8], Pochoir [15], PATUS [4], OPS [12], PLuTo [3], and YASK [16]—which are capable of automating parallelization, cache blocking, vectorization and changes in data layout. Our technique employs optimizations that are orthogonal to those implemented in existing stencil frameworks, and assumes that something else (i.e. the developer or another tool) is responsible for parallelization and data layout; we believe that the optimizations presented in this article are compatible with existing frameworks, but exploration of this is beyond the scope of this work.
The loop fusion and storage/bandwidth reduction optimizations from our technique are similar to some that can be found in other research, but they differ in their implementation. Compiler passes that perform these optimizations automatically have been implemented [9, 7], but their usage is limited to cases in which the compiler is able to recognize that the optimization is both valid and beneficial. Both the OPS framework [13] and the loop chain directives developed by Bertolacci et al. [2] sidestep this problem by fusing loops based on user-provided data dependency information, but neither specifically targets the optimization or removal of intermediate storage locations. The “storage folding” optimization in Halide [11] is most similar to that discussed here, with the key difference being our technique’s specific consideration for efficient vectorization.
To the best of our knowledge, the use of an inference system to discover and satisfy data dependencies that is found in our prototype implementation is novel. However, the use of inference systems in general is not novel in this space. One example is the inference system used by PetaBricks [1] to construct algorithms from device- and context-specific building blocks. HFAV’s inference system is currently much simpler that PetaBricks’—allowing for only one function to produce a given output—and would likely need to be extended if our implementation were to support a wider range of architectures.
Some terminology we use in this article is informed by the declarative, rule-based front-end found in our prototype implementation. While this front-end is useful, the fusion and transformation techniques at the core of this method do not depend on it; we describe our prototype and its front-end, as well as other front-ends in progress, in Section 4.
To help convey the key concepts of our technique, we employ various example codes and demonstrate their transformation process through the code generator. The 5-point Laplace stencil is one such example; this is a common finite-difference discretization of the Laplace operator in two spatial dimensions. Listing 1 shows the Laplace stencil used in a successive-over-relaxation (SOR) method.
We also use a code that performs one-dimensional flux differences on a system of equations that needs to normalize each computed flux vector. This example is greatly simplified from real-world hydrodynamics codes, but shows important data flow concepts.
Our technique operates on codes that apply homogeneous computations over iteration spaces—multidimensional outer products of series of equally-spaced integers. In imperative terms, we operate on codes that call kernels in successive nested ‘simple’^{1}^{1}1
‘Simple’ here means that each loop advance a single integral counter by a fixed integral stride. The number of trips must not depend on other enclosing or enclosed loops or kernel computations.
loops. We reference the dimensions of this iteration space with familiar loop labels such as , , and .While the computations found in these nested loops need not be independent of one another—either within a loop nest or across them—our transformations require that the code be agnostic to the order in which a given kernel is applied; this freedom to re-order computations is critical in transforming the code. This implies that we do not allow kernels to have ‘side-effects’ (i.e. effects on global program state).
Kernels are described in terms of their data accesses/outputs against a canonical frame of reference. So, while the Laplace code shown above accesses a 2D data grid of size , it is convenient to talk about it in a translation-free sense as an operator using relative offsets. This is in fact very much the way the invocation to the laplace5 appears in Listing 1.
When performing fusion analysis, we impose a user-selected global loop ordering on the code: while kernels can be expressed in an order-free listing of iteration space depth (e.g., or ), ultimately they are arranged to match the global order to enable fusion.
Our transformation technique follows the following steps:
Form dataflow graph
Build iteration nests
Build iteration nest graph
Fuse iteration nest graph
Analyze variables
(in-outs, reuse, contraction, vectorization)
Emit code
These steps are described in detail below.
We begin with a dataflow graph provided by the front-end; this is a directed acyclic graph (DAG) with kernel callsites as vertices and intermediate variables forming the edges. Such a graph for the Laplace example is shown in Figure 2. Note that we have added pseudo-kernels such as ‘load’ and ‘store’ to handle terminal references.
A significant amount of information may be gleaned from this graph; in addition to informing the eventual code generation (through a topological traversal of this graph), we can determine the iteration space for each kernel callsite by taking the union of all iteration spaces found on incident variables.
To enable fusion and other transformations, we will use this graph to build a data structure more suited to the task. First, we introduce a building block.
An iteration space describes all trips taken by a set of tightly nested loops, such as the loops found in Listing 1 where there is no code between the two for statements. This is a useful construct, but it cannot describe some iteration patterns that are useful to manipulate.
A common pattern in performing iteration is to perform some work before a loop (the prologue), the main loop body itself (the steady-state), and to perform some work after the loop (the epilogue). We refer to these steps as the phases of the iteration nest. A single iteration nest has an associated identifier, range, and stride that is used to to generate code and to determine compatibility during fusion.
The prologue can be omitted if the kernel invocations and iteration nests therein are the same as those in the steady-state; the same applies to the epilogue; and when all phases are identical, the ‘perfect’ iteration nest directly corresponds to an iteration space. Note that it is possible to permute the nesting of perfect iteration nest, which is equivalent to permuting the order of loops; this is useful for imposing a global loop nesting order on kernel callsites.
The general form of an iteration nest with distinct phases and recursive structure (i.e. other iteration nests in one or more of those bodies) takes the form of a -ary tree. It is sometimes useful to refer to the maximum depth of a given tree when performing fusion.
The dataflow graph’s structure can generate the iteration space for each kernel callsite (see Section 3.2); this can be used to initialize a perfect iteration nest that we will manipulate in later stages. We form the initial iteration nest DAG by first aggregating certain kernel callsites (see ‘Grouping’, below) by merging vertices, and then creating the aforementioned perfect iteration nests from those groups with callsites of the innermost nest steady-states. See Figure 4 for the initial iteration nest DAG from the normalization example.
While it is possible to construct the iteration nest DAG from the dataflow DAG directly (without merging any vertices) it makes sense to combine related kernel callsites. Specifically, we examine the kernel callsites and group them based on: (1) matching kernel names, and (2) matching parameter lists. A parameter list is said to match if the parameters are identical in value and position except for the spatial displacements. So {load(cell[i][j])
, load(cell[i+1][j])
, load(cell[i-1][j])
, load(cell[i][j-1])
}, load(cell[i][j+1])
}—differing only in the displacements in i and j—are grouped together, while {load(cell[i][j])
, load(aux[i+1][j])
} would not be.
Once the iteration nest DAG is constructed, we can begin to fuse the iteration nests located at vertices. Fusion has two processes: (1) an outer level where certain pairs of vertices of the iteration nest DAG are examined to determine if they may be fused; and (2) an inner level where, if fusion is to occur, the two iteration nests at said vertices are fused.
The outer level of this process is a traversal of the iteration nest DAG that maintains a ‘fusing’ vertex (which may be the fused aggregate of many of the vertices the original inest DAG); fusion is attempted across each incoming edge, which eventually results in the exhaustion of incoming edges or an unfusable split (see below). Pseudocode for this procedure is given in Listing 5.
The traversal must occur in a topological order to ensure that we do not introduce cycles when fusing vertices. When an edge is identified as unfusable, we identify all vertices reachable from the candidate vertex and cut the iteration nest DAG along the separating edges between that subgraph and its complement.
The fuse_inest() procedure in Figure 5 handles the details of fusing iteration nests.
First, recall (see Section 3.1) that we have established a global ranking of loop identifiers (e.g., a triple-nested loop might have identifiers ordered as : is rank 2 and outermost, is rank 1, and is rank 0 and the innermost). By construction, the identifiers within iteration nests obey this order, and each iteration nest has an equal to the rank of its outermost identifier—e.g., following the previous example, a double-nested iteration nest over and would have , and a double-nested iteration nest over and would have .
Second, recall that each iteration nest contains—in its leaves—nodes from the dataflow DAG. For any two iteration nests, or phases within iteration nests, we can examine the relative ordering of the dataflow DAG subgraphs induced by these iteration nests. Given two such subgraphs and of the dataflow DAG , we can ask if , which asks if each node of can be topographically ordered before each node of in . For two such subgraphs, there are four possiblities:
= true and = true: and are indepdendent of one another in and maybe be relatively ordered in any fashion.
= true and = false: must be ordered before .
= false and = true: must be ordered before .
= false and = false: and have a cycle between them and cannot be ordered; even though is acyclic by definition, subgraphs of it need not maintain that property.
Fusion operates on pairs of iteration nests (we shall say and ) using these concepts of rank ordering and dataflow ordering, and is applied recursively—a natural consequence of the recursive nature of iteration nests. First, we compare the rank order: and ; this determines how kernels in the phases of and may be combined. When the ranks match (i.e., the outer-most identifiers are the same), we recursively fuse each phase of and directly (i.e. we fuse the prologue of to the prologue of , we fuse the steady-state of to the steady-state of , and we fuse the epilogue of to the epilogue of ), subject to the ability to find a compatible dataflow order, as described above.
When the ranks of and differ, we fuse the lower-ranked iteration nest’s phases into the higher-ranked iteration nests’s prologue or epilogue, depending on the dataflow ordering. In any case, if we are unable to establish an order between iteration nests, we have identified the need to split. Pseudocode for iteration nest fusion is given in Listing 7.
Most real codes are relatively simple to fuse: for example, a 7-point Laplace stencil might consume three-dimensional data from different offsets and produce three-dimensional data. Flux differencing codes consume adjacent pairs of three-dimensional data to compute fluxes, then those fluxes are used in pairs or triples for limiting and integration into a three-dimensional output. When the inputs and outputs to kernels are all of the same dimension, their loops don’t require special handling.
However, there are real use cases where a kernel might consume one-dimensional data to produce three-dimensional data, or vice versa. These require special consideration. We refer to kernels that transform lower-dimensional data to higher-dimensional data as broadcasts and to those that fransform higher-dimensional data to lower-dimensional data as reductions.
Broadcasts can be handled by fusing the producer of the lower-dimensional data into the prologue of one of the consumers’ iteration nests.
Reductions require more consideration because they involve many writes to the same data; because of our iteration-order-agnostic assumptions, valid reduction kernels must implement associative operators. They are typically found in a three-step pattern spread across three kernels: first, an initialization kernel sets the accumulation data to the identity of the associative operator, then the associative kernel accumulates over that data, and last a finalization kernel is run to finish the result. For example, in computation of the mean across a one-dimensional array, you would first set an accumulator to zero, then iterate over the array, adding entries to the accumulator and counting steps, and finally you would divide the accumulator by the number of steps taken. This generalizes to averages across columns or rows of two-dimensional data and so on.
These triples fit nicely into the phase scheme outlined earlier: initialization forms the prologue of the fused iteration nest, the associative kernel forms the steady-state, and the finalization forms the epilogue.
Naturally, it is desirable to fuse iteration nests containing broadcasts and reductions as much as possible to promote locality of reference and reuse. For most cases, iteration nests containing reductions and broadcasts can be handled as any other; the exception is when a broadcast consumes the result of a reduction. Such regions have dataflow from higher dimensions to a lower one (the reduction), and then back up to a higher dimension (the broadcast). We refer to this as concave dataflow and the need to handle the lower-dimensional data property prevents the two higher-dimensional iteration nests (which need not be of the same depth) from being fused. The result is a split.
The identification of a split not only prevents the two iteration nests that gave rise to the split from being fused, but requires that we cut the dataflow graph to separate the two subgraphs—one upstream, one downstream—induced by the identified split. Those kernels downstream of the split depend on the results therein, and therefore we must separate the graph completely.
The variable references from the dataflow DAG can be used to determine the storage necessary to execute each kernel correctly. The size of each iteration nest, combined with the total offsets used to reference each variable, can be used to determine the size of each dimension of access.^{2}^{2}2This is similar to the Minkowski Sum found in geometry.
One advantage of fusion is that it allows for storage to be reduced because of improved locality of reference; we can compute the precise number of times a value is used and schedule loops to minimize intermediate storage.
In most cases, variables end up having limited scope due to their producers and consumers being fused into the same iteration nest. However, when fusion is not possible (because of splits, see above) data must be kept live across distinct iteration nests.
To effect this, we analyze dataflow in the fused iteration nest to find the narrowest region of liveness to use for each variable; we refer to this as the enclosing region for the variable. Most such regions are trivial, being internal to a single iteration nest, but others span multiple iteration nests.
The first step in reducing storage of intermediate values is to identify reuse patterns. This can be done by examining all variable references used as inputs and aggregating those going to the same identifier but with distinct references. For example, a 5-point Laplace stencil access 5 offsets in a single array: .
With references aggregated, we are then able to use the iteration nest’s iterator order to find reuse patterns for each reference. In the Laplace example, if we proceed with iterating in , order in linear fashion, we know that when we access , it is the first time we have visited that value as input. The next time we access that variable will be at , and it will be accessed next by , and finally before it is not needed again.
Figure 8 shows this reuse pattern. This pattern can be produced procedurally in graph form for each identifier by a three-step process: (1) creating vertices for each reference to ; (2) adding an edge from to where appears before in the order induced by the iteration ordering; and finally, (3) computing the longest path in the graph, which will produce a Hamiltonian path covering all nodes in the order that they will be reused.
With reuse patterns computed, it is possible to contract the storage needed for each input variable by determining, for each identifier in an iteration nest, the distance in iterations between the first and last reference for each variable.
For a 1D, 3-point Laplace example that references the array at , grouping analysis shows us that the first time we see a value in the steady-state is at . It is reused by and then , and then isn’t needed again.
Naïve analysis would suggest that we need a buffer for that covers the entire iteration space for the Laplace operator, plus 2 for the and . Distance computations from the contraction analysis show that storage for only 3 values is necessary, and that rotation of the values or use of a circular buffer can manage the updates to storage. See Figure 8(a) for a diagram of this pattern.
The same principle can be applied to multidimensional references. For the 5-point Laplace stencil in 2D, the reuse distance for the inputs to the stencil on a variable given iteration order span to —essentially 2 times the storage necessary for a single -row. We can produce a circular buffer or rotation scheme to cover this, although it is generally most practical to simply allocate 3 times the storage needed for a single row and to quickly rotate those rows through pointer manipulation. Figure 8(b) shows this ‘outer’ rotation scheme.
In most cases, the dataflow graph contains terminal data references that refer to storage external to the analysis and transformation process described here; these are effectively the inputs and outputs to the transformed code. It is possible for inputs to alias outputs (for example, in an in-place stencil update)—the user must report this as part of the transformation input, since our default assumption is that variables do not alias one another.
To preserve correct behavior in the presence of aliasing, we use the dataflow graph to chain from each terminal input across kernel inputs and outputs to find which terminal outputs depend on which terminal inputs. Then, if there is aliasing in interdependent terminals, we determine which input offsets must be copied to temporaries before output can be written back safely.
Vectorization is a powerful tool for improving performance on modern processors, and the assumptions that we impose upon inputs to our transformation technique (order-independent execution, no aliasing, no side effects) generally are very compatible with vectorization.
However, one challenge is that while vectorization is possible for these codes, it is often not helpful when performance is bound by memory bandwidth; this is typical of codes with disparate loops with multiple streams and long reuse distances—precisely the kind of code that our technique seeks to transform.
Fusing loops, reducing storage footprint, and reducing reuse distance can improve the arithmetic intensity of code to the point where vectorization is beneficial, and generally doesn’t impede the process of vectorization.
One hiccup here is that contracting storage (see above) into circular buffers results in data access patterns incompatible with vectorization; in this case, the circular buffers are expanded by a target vector length and vector code to perform an in-place rotation of the buffer can be emitted. See Figure 8(c) for a visual depiction of how circular buffers expand.
With the fused iteration nest DAG and full variable analysis completed, code generation can begin.
We visit the iteration nest DAG in topological order, emitting enclosing regions and their variable definitions as we encounter them. We emit a loop body for each iteration nest phase and recursively traverse children in an in-order fashion. Any kernel invocations in a phase are emitted according to the topological order found in the original dataflow graph, with appropriate variable references substituted.
When leaving iteration nests, the loops are closed and any necessary variable rotations are emitted.
Our prototype implementation of the methods described in this paper—named HFAV—is a research-quality tool that nonetheless is able to produce compelling and even surprising results. HFAV is written in Python for quick development and access to external libraries, but could easily be imagined in a lower-level, more performant language in a production environment.
We use a custom YAML format for handling user input. The front-end of our tool may appear to accept C-like input, and that may be a convenience for users coming from C, but it is foremost simply a way of declaring types, functions, and dependencies, and the mechanics of the way code is handled in the front-end are purely based on substitution. HFAV only needs to know the positions of arguments and the function name to emit compilable code.
The code emitted by HFAV can be included directly into programs or treated as stand-alone source files. The backend languages supported by our tool are C++ and C99, with some work on a Fortran 90 backend begun.
HFAV analysis begins with a dataflow graph we refer to as the ‘inference DAG’, or IDAG. This is a directed graph with individual data accesses—concrete terms—as vertices and function calls and rule applications or RAPs—as edges. Input terms form the roots of the IDAG, and output terms form the leaves.
It is possible to define a dual of the IDAG that has RAPs as vertices and the terms interchanged between them as edges; we refer to this as the RAP dual. This is in fact the same graph referred to as the ‘dataflow DAG’ in Figure 2.
The availability of auto-vectorizing compilers and directives (like the omp simd
support found in OpenMP 4.0 [10]) means that our transformation can emit scalar loops that the compiler is able to vectorize. For the cases where some special rotations are requires (see Secs.3.5 and 3.5), we provide snippets of the appropriate language with intrinsic sequences for various instruction sets.
One advantage of a Python implementation is that it is possible to easily emit various graphical outputs showing reuse patterns and dataflow; HFAV is capable of displaying these graphs at the users’ request and is the basis for many of the figures in this article.
We have released this prototype as an open-source project under the Apache License 2.0: https://github.com/intel/HFAV.
To demonstrate the utility of HFAV in improving performance for bandwidth-bound codes, we present performance results and analysis for three different applications: the simple normalization example discussed throughout this article; a diffusion stencil; and a shock hydrodynamics benchmark.
SKX | KNL | |
---|---|---|
Ref. Clock (GHz) | 2.1 | 1.4 |
L1 / L2 /L3 Cache (KB) | 32 / 1024 / 32768 | 32 / 512 / - |
Peak DP GFLOP/s | 3225 | 3046 |
STREAM Bandwidth (GB/s) | 200 |
90 (DDR)
490 (MCDRAM) |
The system configuration used for all of our experiments is given in Table 1.
We use the Intel^{®} Compiler version 18.0, using the -O3 -xHost flags to request optimizations specific to the host on which the code is compiled. We use all cores available on SKX and 64 cores on KNL (leaving 4 cores free for the operating system), launching two threads per core on both systems. Thread affinities are set with KMP_AFFINITY=compact,granularity=fine (to control thread placement) and KMP_HW_SUBSET=${cores},2t (which limits thread placement to specific subsets of logical cores).
The KNL is configured in ‘Quad/Flat’ mode: its DDR and MCDRAM memories are exposed to software as two separate NUMA nodes, and the user must explicitly request allocations backed by MCDRAM (e.g. through use of libnuma and its numactl utility). We present results using MCDRAM only, thus highlighting that HFAV’s data locality and bandwidth usage optimizations remain relevant even in the presence of the high-bandwidth memories available on some architectures.
After fusion (see Figure 6), the normalization example requires two loop nests: one containing the flux computation, norm accumulation and norm root; and another containing the final divisions and normalization operation. The split between these two nests (required because of the reduction) prevents HFAV from performing array contraction optimizations—the data consumed by the second nest is produced by the first.
The performance improvement arising from HFAV in this instance is therefore a direct result of loop fusion, which reduces the number of times that the entire space is visited from five to two. As shown by the graph in Figure 12, this reduction in bandwidth requirements is reflected by the performance improvement that we see for large problems that do not fit in cache.
In [8], Gysi et al. use a two-dimensional fourth-order diffusion stencil—a proxy for operations arising in the dynamical core of the COSMO atmospheric model [6]—to demonstrate the functionality of the STELLA embedded DSL. The stencil is applied over three-dimensional data (with no dependencies in one of the dimensions) and consists of four kernels performing very few floating-point operations per cell.
The four kernels are as follows:
ulapstage: 5-point Laplace operator, computed using the value for a cell and its neighbors in and
flux_x: flux computation in , using the values and the results of ulapstage for two neighboring cells in
flux_y: flux computation in (as above, but in )
ustage: integration, using the value for a cell and the four neighboring fluxes in and
The graph in Figure 11 shows performance results for several variants of the COSMO micro-kernels executing on KNL. An optimized ‘STELLA’ version provided by Gysi et al. fuses the final three kernels, with the fluxes computed redundantly for each cell. The ‘HFAV’ version merges all four kernels, using rolling buffers of sizes 2 and 3 for the fluxes and Laplacians respectively—in addition to reducing bandwidth requirements, memory footprint is reduced from to . Introducing an additional rolling buffer for the input values would permit the entire operation of this benchmark to be performed in-place, further reducing the memory footprint by ; however, we omit this optimization from our study since we do not know whether it can be safely applied to COSMO.
As before, the performance impact of HFAV is largest for larger problem sizes, where intermediate results do not fit in cache. In order to remain representative of COSMO, all problem sizes considered here are quite small—relative to the degree of parallelism available in modern machines—and therefore overheads introduced by core-to-core communication (i.e. synchronization) and the software pipeline are exposed, limiting performance. Our prototype assumes that the amount of work per loop iteration and the number of pipeline stages are both sufficiently large to amortize the costs of priming the software pipeline and accessing intermediate buffers. This is not true for the COSMO micro-kernels when run on small problems, and we were able to improve upon our results by manually tuning the generated code (’HFAV + Tuning’)—specifically, by using intrinsics to force alignment and instruction masking to fold the prologue/epilogue into the steady-state. The manual nature of these optimizations reflects the immaturity of our prototype, and its current reliance on the compiler to auto-vectorize generated loops efficiently.
Hydro2D is a two-dimensional shock-hydrodynamics benchmark developed by CEA [5]. The rolling optimizations performed by HFAV were first manually developed and tested in the context of Hydro2D, and we refer the reader to our previous work [14] for more details of that effort.
Hydro2D implements nine kernels:
make_boundary: Set the boundary conditions.
constoprim: Convert unknowns from ‘conservation form’ to ‘primitive form’.
equation_of_state: Complete the system of primitive equations.
slope: Approximate derivatives of the averaged solution.
trace: Limit slopes appropriately.
qleftright: Split components according to cell boundaries.
riemann: Solve Riemann problem at each boundary.
cmpflx: Compute conservative fluxes based on Riemann solution.
update_cons_vars: Integrate conservative variables in time.
The operator is dimensionally split (i.e. it is applied in one dimension at a time), and each of its kernels therefore have dependencies in only one dimension. Since the dimension in which a dependency occurs is different for each pass, HFAV effectively requires the user to specify the dependency information twice.
HFAV is able to fuse all nine of Hydro2D’s kernels into a single loop nest, removing a significant number of intermediate buffers via array contraction in the process. Storage for all but the four primitive variables (, , and ) can be replaced by rolling buffers with a maximum of 5 stages: memory footprint is thereby reduced from to in both the and passes.
The graphs in Figure 13 show that HFAV is able to achieve performance similar to those of our manual efforts (from [14]); for problems where a significant number of kernels can be fused, the instruction overheads that effected the COSMO micro-kernels can be amortized. Not reflected in the graph is the shift in architectural bottleneck after fusion; fusing all of the kernels and contracting the vast majority of the arrays moves Hydro2D from being bandwidth-bound to compute-bound—indeed, our generated code has almost identical performance on KNL whether using DDR or MCDRAM memory (results not shown).
We have presented a technique for transforming input code consisting of kernels and loops into fused, vectorized kernels that can have more compact footprints and better data reuse properties. Our technique integrates with outer-level parallelization schemes such as OpenMP or MPI, and we have shown sizable speedups for important proxy codes on state-of-the art hardware.
We hope our technique lays the groundwork for expanded and more powerful code generation/transformation tools.
Some of the assumptions underlying our technique can be restrictive, and we would like to see future efforts address these restrictions. In particular, we would like to see direct support for automatic domain decomposition and threading, treatment of intermixed loop orderings, and the ability to fuse codes on unstructured grids.
We believe that the data dependency information required by HFAV could be expressed via directives and integrated into existing codes—similar to the approach in [2]—and are interested in exploring this approach.
Intel, the Intel logo, Intel Xeon, Intel Xeon Phi and Intel VTune are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Intel does not control or audit third-party benchmark data or the other papers referenced in this document. You should visit the referenced documents and confirm whether referenced data are accurate.
The authors would like to thank the members of the HEAT team at Intel for their support of this project, as well as Guillaume Colin de Verdière of CEA and Grzegorz Kwasniewski, Tobias Gysi, and Torsten Hoefler of ETH Zurich for their help with application codes.
Comments
There are no comments yet.