In scientific and high-performance computing, there are two types of people who interact with code: the Domain Scientist, who knows what to compute; and the Performance Engineer, who knows how to compute it efficiently. The importance of efficiency is amplified in large-scale computations, which involve scheduling tasks to thousands of computers, taking days or weeks to complete. Because computing time and energy can be expensive, supercomputing centers employ teams of such performance engineers who improve many different applications.
The two main ingredients of a scientific application are computations and memory accesses. With the emergence of new processing technologies, arithmetic operations and energy efficiency improve, whereas data movement does not match the same rate (Unat et al., 2017). Consequently, application runtime is dominated by the overall data movement, which must be optimized for data-locality, regardless of the target architecture (CPU, GPU, or FPGA).
General-purpose compilers solve a significant portion of the problem with peephole optimizations, or with automated analysis of specific application classes, a notable example being polyhedral model-based optimization (Grosser et al., 2012). However, because compilers have to be conservative with assumptions on application semantics and data management, such as pointer aliasing, it is hard to automatically parallelize and optimize code. This problem is exacerbated when using accelerators and distributed environments, as memory and code are fragmented between different library calls (MPI, OpenMP), languages (CUDA, FORTRAN), and programming paradigms (FPGAs). As a result, multiple implementations of the scientific application co-exist within the same code-base, each of which is tuned for a different system.
In this paper, we propose an intermediate representation that decouples scientific applications from their mapping to a given system. The representation111https://www.github.com/spcl/dace, named the Stateful DataFlow multiGraph (SDFG), is targeted toward performance engineers and is capable of conveying various optimizations related to data movement, all without modifying the original application computations.
SDFGs (Fig. 1) are based on parametric graphs that combine data dependencies with stateful execution. They enable data-centric optimization by ❶ tracking memory and fine-grained data movement; ❷ hiding computational code and allowing domain scientists to modify it separately; ❸ visibly encapsulating parametric parallelism; and ❹ enabling execution of sequential decision and data-dependent processes. As a result, it is possible to use SDFGs to express programs with regular access patterns, but also with nonlinear, data-dependent, and irregular accesses. Compared to other dataflow-based representations (Isard et al., 2007; Zhou and Demsky, 2010; Thies et al., 2002), SDFGs are a mid-point between high-level representation and expressiveness, enabling complex multi-state applications that are analyzable with respect to data movement optimizations.
To support optimizations, we define a pipeline for SDFG compilation. Specifically, the SDFG is modified successively, gradually optimizing data movement for a given architecture using graph transformations. The process of transformation uses subgraph matching and replacement, allowing performance engineers to change the SDFG with or without affecting application validity.
SDFGs are designed to be generated from Domain Specific Languages (DSLs) and high-level paradigms, generating high-performance code for parallel and reconfigurable architectures. We demonstrate how Python, a subset of MATLAB, and TensorFlow can be lowered into SDFGs, which in turn are compiled to CPUs, GPUs, and FPGAs.
The contributions of this paper are as follows:
We design the SDFG, a data-centric IR that enables separating code definition from its optimization, and provide operational semantics.
We demonstrate graph transformations and DIODE: an IDE for guided performance optimization of SDFGs.
We demonstrate SDFGs on fundamental kernels, polyhedral applications, and graph algorithms. Results are competitive with expert-tuned libraries from Intel and NVIDIA, approaching peak hardware performance, and up to 5 orders of magnitude faster than naïve FPGA code written with High-Level Synthesis, all from the same representation.
2. Data-Centric Application Development
Executing parallel scientific applications efficiently requires the utilization of two major resources: computational power and memory. While memory capacity steadily increases; latency, throughput, and energy efficiency improve at a slower rate (Unat et al., 2017). Data movement has thus become the primary bottleneck, consuming orders of magnitude more energy and taking the lion’s share of the runtime.
The resulting trend in high-performance computing optimizations, regardless of the underlying architecture (CPU, GPU, FPGA), is to increase data locality, i.e., keep information as close as possible to the processing elements and promote memory reuse. Examples of such optimizations include utilization of vector registers, memory access coalescing and “burst” accesses, application-specific caching using scratch-pad memory, and streaming data between processing elements. It is thus apparent that none of these optimizations involve the computations themselves.
Such optimizations, which require modifications to an application’s core (e.g., data layout change), cannot always be performed automatically. Even a simple application, such as matrix multiplication, requires multiple stages of transformations, including data packing and register-aware caching (Goto and Geijn, 2008). This problem especially resonates with general-purpose compilers, which have to be conservative with assumptions on data management and application semantics.
Because optimizations do not modify computations, and due to the fact that they differ for each architecture, maintaining portability of scientific applications requires separating computational semantics from data movement logic.
2.1. Decoupling Scientific Code and Optimizations
In order to improve productivity and performance portability, we propose to separate application development into two stages, as shown in Fig. 2. First, the problem is formulated as a high-level program, which includes the computational aspects and definition of data dependencies for each computation. Second, the code is transformed into a human-readable Intermediate Representation (IR), which can be modified for each system, without changing the original code.
The modifications performed on the IR (by the performance engineer in the figure) are made as a result of informed decisions based on the program structure, hardware information, and intermediate performance results. To support this, an interactive interface should be at the performance engineer’s disposal, enabling modification of the IR in a verifiable manner (i.e., without breaking semantics) and periodically measuring the results of such modifications.
In this development scheme, the domain scientist writes an entire application once for all architectures, and can freely update the underlying calculations. For the domain scientist, it is our goal that modifying high-performance code should be as easy as changing a mathematical equation.
2.2. Scientific Programming Interfaces
Scientific applications typically employ different programming models and Domain-Specific Languages (DSLs) to solve problems. To cater to the versatile needs of the domain scientists, SDFGs should be easily generated from various languages. We thus design the SDFG to inherently support high-level languages, and provide a low-level API to easily map other DSLs to SDFGs.
In particular, generating SDFGs is either possible directly, using a graph (node/edge) API, or from existing programming languages, enabling the use of existing IDEs and debuggers. Fig. 3 highlights the differences between interfaces using the same problem: computing for a matrix .
Python Programming Interfaces
In Figures 2(b) and 2(e), we use a high-level Python (Foundation, 2018) interface to define data-centric programs within existing code. SDFGs exist as decorated, strongly-typed functions in the application ecosystem, and can be interfaced with using plain data-types and numpy (Developers, 2018) multi-dimensional arrays. Within a function, it is possible to work with explicit or implicit dataflow. In the former case (Fig. 2(b)), the interface enables defining maps, tasklets, and reductions using decorated functions, within which fine-grained data dependencies must be explicitly defined using shift (<<, >>) operators on arrays. Within maps/tasklets, use of dynamic objects (e.g., dictionaries, lists) and direct access to arrays and streams is disallowed (§3).
In implicit dataflow (Fig. 2(e)), arrays can be used with operators (i.e., C = A @ B for matrix multiplication) and methods from the numpy interface (e.g., transpose), in order to generate sub-graphs. To support parametric sizes (e.g., of arrays and maps), we utilize symbolic math evaluation in Python by extending the SymPy (Team, 2018) library. The code can thus define symbolic sizes and use complex index expressions, which will be analyzed and simplified during SDFG compilation.
Once such a function is called (or explicitly compiled), we employ Python’s runtime AST reflection, bottom-up data dependency extraction, and the low-level SDFG API (Fig. 2(c)) to create the state machine and labeled dataflow multigraphs.
is a Machine Learning framework, built over a dataflow representation. In TF, the user defines the model to learn using high-level operations on tensors in Python or C++, each of which returns a new tensor (DFG node, implicitly allocating new memory). To execute a graph, the user specifies tensors they wish to compute, and TF traverses back through the graph to figure out which computations to run. We implement a runtime graph converter from TF to SDFGs (using the low-level API), which parses the TF node types into explicit computation. To use this frontend (Fig.2(d)), a user only has to swap tf.Session with sdfg_tf.Session, while the rest of the code (i.e., data and model construction) remains unchanged.
Apart from a numpy interface, we lower a subset of the MATLAB syntax to SDFGs (Fig. 2(f)). The high-level math interface facilitates writing complex linear algebra (e.g., solvers) and conversion of existing scientific code. However, both MATLAB and Python are challenging, since they are dynamically typed and allow overriding existing names. For this purpose, we implement a MATLAB parser, and convert the resulting Python and MATLAB ASTs to Static Single Assignment (SSA) representation. We then run type inference on the inputs, propagating through the statements and annotate undetected types using special comments and type hints. The result is that the code in Figures 2(e) and 2(f) would both be parsed to ; , as opposed to direct computation in the explicit case. However, it is possible to obtain the same SDFG as Fig. 2(a) using graph transformations.
|[height=0.3in,trim=1.4cm 1.4cm 1.4cm 1cm,clip]primstatetransrankdir=LR; s0 [shape=rectangle, color=blue, fillcolor=”0.58, 0.1, 0.97”, style=filled];s1 [shape=rectangle, color=blue, fillcolor=”0.58, 0.1, 0.97”, style=filled]; s0-¿s1 [label=”iter ¡ N”, color=blue, fontcolor=blue]||State: State machine element.|
|[height=0.3in,trim=1.25cm 1.3cm 1.25cm 1cm,clip]primarraysubgraph rank=same; ”Data” [width=1, style=”bold”]; ”Transient Data” [width=1]||Data: N-dimensional array container.|
|[height=0.3in,trim=1.25cm 1.4cm 1.25cm 1cm,clip]primtasklet”Tasklet” [shape=octagon, width=2]||Tasklet: Fine-grained computational block.|
|[height=0.3in,trim=1.8cm 1.6cm 3.2cm 1cm,clip]primmove2rankdir=LR; a[style=invis, shape=point];b [style=invis, width=0, height=0, shape=rectangle]; c[style=invis]; a-¿b [label=”A[i,j]”]; b-¿c [label=”B(1) [0:M,k]”, penwidth=3]||Memlet: Data movement descriptor.|
|Dataflow and Concurrency|
|[height=0.3in,trim=1.25cm 1.4cm 1.25cm 1cm,clip]primstream”Stream” [style=dashed, width=1]||Stream: Streaming data container.|
|[width=trim=1.25cm 1.4cm 1.25cm 1cm,clip]primmapsubgraph rank=same; ”[i=0:M, j=0:N]” [shape=trapezium]; ” [i=0:M, j=0:N]” [shape=invtrapezium]||Map: Parametric graph abstraction for parallelism.|
|[height=0.3in,trim=1.8cm 1.6cm 3.2cm 1.25cm,clip]primcrrankdir=LR; b [style=invis, width=0, height=0, shape=rectangle]; c[style=invis]; b-¿c [label=”C [i,j] (CR: Sum)”, style=dashed]||Conflict Resolution: Defines behavior during conflicting writes.|
|Nesting and Subgraph Aliases|
|[height=0.3in,trim=1.25cm 1.4cm 1.25cm 1cm,clip]priminvoke”Invoke” [shape=doubleoctagon, width=2]||Invoke: Call a nested SDFG.|
|[height=0.22in,trim=1.4cm 1.4cm 1.4cm 1.8cm,clip]primreduce”sum, id: 0” [shape=invtriangle]||Reduce: Reduction of one or more memlet axes.|
|[width=trim=1.25cm 1.4cm 1.25cm 1cm,clip]primconsumesubgraph rank=same; ”[p=0:P], s¿0” [shape=trapezium style=dashed width=2]; ” [p=0:P], s¿0” [shape=invtrapezium style=dashed width=2]||Consume: Dynamic mapping of computations on stream elements.|
3. Stateful Dataflow Multigraphs
In this section we describe the Stateful DataFlow multiGraph (SDFG) intermediate representation. We define the SDFG syntax, discuss its expressiveness, and show how the representation can be mapped to high-performance applications.
An SDFG is defined as a directed graph of directed acyclic multigraphs, whose components are summarized in Table 1. Briefly, the SDFG is composed of acyclic dataflow multigraphs (i.e., where two nodes can be connected by multiple edges), in which nodes represent computation and storage, and edges represent data movement. To support cyclic data dependencies and control-flow constructs, such as sequential iteration (Fig. 1) or data-dependent decisions (Fig. 4), the multigraphs are encased in State nodes at the top-level graph. Following complete execution of the dataflow in a state node, state transitions (edges) specify a transition condition and a list of assignments, forming a state machine. For complete operational semantics of SDFGs, we refer to the supplementary material (Appendix A).
Expressing an algorithm as an SDFG has several advantages over inherently sequential representations (e.g., C++ and other imperative languages), where statements are executed in order. It fosters the expression of parallelism, since sequential operation either implicitly occurs as a result of data dependencies, or is explicitly programmed using multiple states. In the rest of this section, we define the state (multigraph) nodes and their behavior.
Tasklet nodes represent arbitrary computational functions in any source language, which can only access and manipulate memory given as explicit data dependencies via Memlet edges. As seen in Section 2.2, the SDFG is designed for fine-grained tasklets, so as to allow analysis and modification in the optimization process.
A Data node represents a location in memory that contains an N-dimensional array, whose dimensions are annotated within the node. There are two types of data: input/output, accessible from the outside; and transient, which is only allocated for the duration of SDFG execution. This allows the performance engineer to distinguish between buffers that possibly interact with external systems, and ones that can be manipulated (e.g., data layout change) or eliminated entirely (e.g., via fusion). Fig. 4 shows an example that adds two numbers , doubles or halves the transient result depending on its value, and stores the end result back to .
Although states, tasklets, memlets, and data nodes are already expressive (specifically, Turing-complete); a major contributing factor to performance is parallelism (§2). Expressing parallelism is inherent in SDFGs by design. However, since programs often use parametric sizes for data, parametric parallelism must be built into the representation. For that purpose, we define Map nodes to create parametrically-sized subgraph scopes. As shown in Fig. 5, these nodes define new variables within an enclosed subgraph (dominated by the map entry node and post-dominated by the map exit node), in which execution may occur concurrently without violating pre- and post-condition semantics (see Appendix A). This enables variable-size parallelism, and considerably reduces the number of nodes in the constant case.
In order to handle concurrent memory writes to the same address from within a map, we define an optional Conflict Resolution function attribute on the memlet. As shown in Fig. 2(a), such memlets are visually highlighted for the performance engineer using dashed lines. Implementation-wise, such memlets can be implemented as atomic operations, critical sections, or accumulator modules, depending on the target architecture and the function.
Stream nodes are defined as locations in memory, accessed using multiple-producer multiple-consumer concurrent queue semantics. Streams can be multi-dimensional, in order to represent (possibly parametric) arrays of streams, a construct that can be used to build systolic arrays, commonly used in efficient circuit design. An example of a stream can be seen in Fig. 5(a), in which elements are conditionally pushed through a stream (depending on a query) into an array concurrently.
As programs occasionally require stateful execution within dataflow, or a parametric number of state machines, our model allows the definition of nested SDFGs via the Invoke node. As demonstrated in the Mandelbrot example (Fig. 5(b)), each element requires a different number of iterations to converge. In this case, an invoke node calls an SDFG within the SDFG state to manage the convergence loop. The semantics of invoke are equivalent to a tasklet, thereby disallowing access to external memory without memlets. Recursive calls to the same SDFG are disallowed, as the nested state machine may be inlined or transformed by the performance engineer. As a result, SDFGs cannot natively express parametric-depth and data-dependent recursion (e.g., quick-sort). However, recursion is a CPU-oriented construct that does not map well to other architectures, and it is hard to analyze its data movement. Nevertheless, SDFGs can represent a stack to emulate the required behavior, though it is discouraged.
3.1. Memlet Definition
Formally, a memlet is defined by the tuple (src, dst, subset, reindex, accesses, wcr). The subset function selects which subset of elements visible at the source will flow to the destination; the reindex function specifies the indices at which data will be visible at the destination; accesses is an indicator of the number of accesses on the subset, or the number of pop/push operations on a source/destination stream (denoted ? if unknown); and wcr is a write-conflict resolution function for a data type . We express subset and reindex as functions on integer sets, and implement them as lists of exclusive ranges, both to promote analyzability and to provide human-readable labels, e.g., A[0:N:2,5:M-3].
The reasoning behind encoding the number of accesses is threefold: (1) it is an estimator of data movement volume, used for analysis by the performance engineer; (2) it helps to programmatically identify patterns such as indirect memory access; and (3) the number matches production and consumption counts on streams, which can be used to verify correctness of the streaming behavior in a circuit.
Indirect memory access, i.e., A[b[i]], is an important characteristic of sparse data structure and pointer jumping algorithms. Indirection cannot be directly represented by a single memlet. As shown in Fig. 7 (an excerpt of Sparse Matrix-Vector Multiplication), such statements are converted to a subgraph, where the internal access is given exactly (A_col[j] in the figure), and the indirect memory is given by a memlet with one access (x(1)[:]). The memory is then accessed dynamically and copied to a node using a tasklet.
3.2. Subgraph Aliases
The last two nodes in Table 1 are alias nodes: semantically-equivalent “shorthands” of certain subgraphs. These nodes are nonetheless defined in order to provide optimized per-architecture implementations of their functionality.
The Reduce alias node is used to implement fast, target-dependent reduction procedures. It is defined by axes to reduce and a wcr function, similar to a conflict resolution memlet. It is semantically equivalent to a map whose range spans the incoming and outgoing memlet data, containing an identity tasklet (output=input) and an output memlet annotated by the given wcr for the reduced axes. The example in Fig. 7(a) uses this node to implement declarative matrix multiplication by way of tensor contraction in two dimensions. In the example, all pairs are multiplied into a 3-dimensional transient tensor tmp. This tensor is then reduced with summation on the last () axis to produce the 2-dimensional matrix C. As we shall show, a data movement transformation can directly convert this representation to a classical matrix multiplication, as in Fig. 2(a).
Consume entry/exit nodes complement maps in the same manner that streams complement data nodes. Such scopes enable producer/consumer relationships via dynamic parallel processing of streams. Consume nodes are defined by the number of processing elements, an input stream to consume from, and a quiescence condition that, when evaluated to true, stops processing. Semantically, a Consume scope is equivalent to a map using the same range as the processing elements, with an internal nested SDFG containing the scope. The nested SDFG contains a single state, containing the scope and connecting to itself with the negated quiescence condition. A trivial example is shown in Fig. 7(b), which computes the Fibonacci recurrence relation without memoization.
4. Performance Engineer Workflow
The stateful dataflow multigraph is designed to expose application data movement motifs, regardless of the underlying computations. These properties may be characterized by memory access patterns or by an optimizable combination of control- and data-flow. As such, the optimization process of an SDFG consists of finding and leveraging such motifs, in order to mutate and tune the dataflow of the program.
Below, we describe the two principal tools we provide the performance engineer to guide the optimization process.
4.1. Graph Transformations
Informally, we define a graph transformation on an SDFG as a “find and replace” operation, either within one state or between several, which can be performed if all of the conditions match. For general optimization strategies (e.g., tiling), we provide a standard library of such transformations, which is meant to be used as a baseline for performance engineers (see Appendix B for transformations used in Section 6).
Transformations consist of a pattern subgraph and a replacement subgraph. A transformation also includes a matching function, used to identify instances of the pattern in an SDFG, and programmatically verify that all other requirements are met. To find matching subgraphs in SDFGs, we use the VF2 algorithm (Cordella et al., 2004) to find isomorphic subgraphs. Using the transformation infrastructure, performance engineers can implement and share optimizations across applications.
Two examples of SDFG transformations can be found in Figures 8(a) and 8(b). In Fig. 8(a), the transformation detects a pattern where Reduce is invoked immediately following a Map, reusing the computed values. In this case, the temporary array $A can be removed and computations are fused with a conflict resolution, resulting in the replacement . Note that this transformation can only be applied if the array is not used by any other state or connected to other nodes. In the second example (Fig. 8(b)), a local array is added between two map nodes, and the relative indices are changed in all subsequent memlets.
In Fig. 10, we see an example of a transformation workflow on a 4-point stencil. Only by applying tiling and local storage are we able to introduce computational redundancy within tiles, and therefore fuse two time iterations without violating program semantics. In this case, the four transformations lead to a 3.6 speedup and a considerable reduction in overall data movement.
SDFGs are intended to be malleable and interactive, which we realize with an Integrated Development Environment (IDE). The Data-centric Integrated Optimization Development Environment, or DIODE, enables programmers to interactively modify SDFGs and create transformations.
In its main view (Fig. 11) — the optimizer window — it is possible to compile Python and MATLAB programs to SDFGs; modify the resulting graphs, either manually or using transformations; and run the generated code locally or remotely. All changes to the SDFG are immediately reflected in the generated code, which leads to substantially increased productivity. For instance, in GPUs an engineer can move arrays from global to shared memory with the click of a button, or do the same to schedule maps as separate units or fused, where the code generation would add the appropriate memory copies and barriers. By clicking the run button, the program executes and its runtime is appended to a bar graph (Fig. 11, bottom). This enables progress tracking and historical inspection of performance results for making informed optimization decisions. In addition to applying transformations, users may create their own using a transformation editor. For scaling development to large programs, the view of the SDFG is hierarchical, namely, scopes and states can be collapsed. This allows the performance engineer to focus on a specific program region and ease graph rendering load.
We demonstrate the process of interactively optimizing a matrix multiplication example (Fig. 7(a)) to the version in Section 6 in a video222https://www.vimeo.com/301317247. As we shall show, using the IDE and the SDFG representation yields nearly the same performance as Intel’s optimized Math Kernel Library (MKL) (Intel, 2018).
5. Hardware Mapping
Below we discuss mapping SDFG nodes to architecture-specific programming paradigms, followed by the compilation pipeline of the IR.
Mapping an SDFG to different architectures is supported by annotating nodes with a schedule executor (Map, Consume, Reduce, and Invoke nodes) and their storage location (Data, Stream). For example, a map node can be scheduled to GPU_Device, which would in turn generate a kernel call for the scope subgraph of the map. Due to the construction of the SDFG, one important benefit is that the same graph that runs on a GPU or an FPGA can also run correctly on a CPU and vice versa (albeit not always in a profitable way), facilitating porting applications between architectures.
5.1. Mapping to Load/Store Architectures
Tasklets and Data
Throughout the process of compiling and running SDFGs, the tasklet code remains immutable. On all platforms, tasklet nodes are directly converted to C++ code. To support high-level languages, we implement a Python-to-C++ converter, using features from C++14. We include a thin C++ header-only runtime, which contains data movement routines and abstractions. Data nodes are directly mapped to memory, allocated statically or dynamically according to the node storage type. Streams are implemented with concurrent queues on CPUs, or use concepts such as warp-aggregated atomics (Adinetz, 2014) for GPUs.
The map construct is versatile enough to encapsulate parallel iteration on CPUs (implemented using OpenMP directive annotation), as well as GPU execution (as mentioned above). Additionally, different connected components within a state can also run in parallel, and thus mapped to parallel sections in OpenMP or different streams in GPUs. These two concepts are cumbersome to program manually for both platforms, where synchronization mistakes, ordering of calls to the library, or even knowing about features (e.g., nowait, non-blocking CUDA streams) may drastically impact performance or produce wrong results. We implement more advanced schedulers such as dynamic GPU thread-block scheduling, which enables collaborative processing by re-dividing map range (which may be parametric, e.g., in sparse structures) among warps at runtime.
Consume nodes use the fact that processing does not have to occur in order, and are implemented using batch stream dequeue and atomic operations to asynchronously pop and process elements. The potential to encompass complex parallel patterns like work stealing schedulers using high-performance implementations of this node dramatically reduces code complexity and could yield higher performance.
Conflict resolution nodes also map to load/store architectures using traditional concepts, such as atomic operations and critical sections, depending on the operator. Specifically, we use symbolic and AST analysis to detect all 12 reductions defined by the MPI and OpenMP standards.
Memory and Accelerator Interaction
Storage locations in SDFGs map to the memory hierarchy on the target architecture (e.g., GPU shared memory, CPU pinned memory). If different arrays are connected by memlets, the appropriate memory copy calls are generated automatically. To assist performance engineers and increase productivity, SDFGs provide dataflow validation passes to ensure that memory accesses are indeed possible, prior to compilation of the generated code. Examples include tasklets that run on the GPU cannot access virtual (non-pinned) host memory via DMA, or accessing GPU shared memory directly from host code.
The SDFG representation inherently enables automatic porting of CPU programs to GPUs, using data-centric transformations. We provide such a transformation, which (1) copies data read/write ranges to/from the GPU before/after they are used; (2) converts all maps to GPU maps and scope-less tasklets to trivial (1-thread) GPU kernels; (3) allocates transient memory on the GPU; and (4) applies state and transient fusion to eliminate unnecessary copies. As we shall show, even without manual tuning, this transformation yields competitive results on various applications.
5.2. Mapping to Pipelined Hardware Architectures
Parallelism in reconfigurable hardware is primarily achieved by exploiting temporal locality through pipelining (de Fine Licht et al., 2018), and secondarily by exploiting spatial locality through vectorization (and similar bandwidth-limited replication). The SDFG directly exposes opportunities for both: as tasklets always exist in a dataflow context, they can be automatically pipelined. In maps, we can unroll the scope to achieve vectorization of the data path, or unroll them into independent data paths for data-dependent access.
In addition to the internal dataflow of operations within a tasklet, additional temporal parallelism can be exploited by reducing coarse-grained dependencies between components to fine-grained dependencies that can be streamed between them. For components with identical iteration spaces this can be done by fusing them. This is the same optimization performed on the SDFG as when targeting software, with the additional benefit of fully parallelizing the merged sections by pipelining them. When the iteration spaces differ, but still allow for exchanging fine-grained dependencies, this is implemented using stream nodes, produced and consumed from independent components in the SDFG, generating concurrent modules in hardware.
Explicit control flow plays a number of roles in mapping to hardware, ranging from the outer to the inner level of the SDFG. At the outermost level, states can be used to express a sequential section to offload to hardware. Within a hardware section, states allow enforcing sequentiality for cases where coarse-grained synchronization is unavoidable, resulting in state machines that change the behavior of hardware modules depending on the stage of the program. Finally, within dataflow sections, control flow can be embedded as nested SDFGs to implement predicated behavior, achieving fine-grained control flow without breaking the pipeline.
Unlike instruction-based architectures, where all computational logic is at equal distance to the off-chip memory interface, hardware allows distributing logic across a much larger area, and does not require every computational element to communicate directly with the interface to off-chip memory. In SDFGs, all accesses to off-chip memory are explicit on the graph, quantifying the number of components that must be connected to memory, the memory bandwidth, and latency utilization of a given hardware section. This allows the performance engineer to immediately identify expensive memory accesses, and mitigate the creation of extraneous physical buses to memory interfaces.
To scale up logic utilization, parametric generation of hardware can be represented in SDFGs with unrolling (Unroll schedule type). In addition to spatially unrolling computations (e.g., when vectorizing, using a “horizontal” map), we can use maps to generate “vertical” pipeline parallelism, by letting each element in the expanded map read from distinct memory. In the example shown in Fig. 12, each processing element (nested SDFG) accesses a distinct stream from a stream array, forming a chain of parallel components.
We implement generation of hardware kernels for PCIe-attached FPGA accelerators. Each hardware section in the SDFG (maps with FPGA schedules) produces an OpenCL kernel, written to the FPGA when starting the program. Like GPUs, data connections between FPGA and host memory will emit copies over PCIe, and fast memory can be mapped to the FPGA fabric using an FPGA local storage type. Each component of the dataflow graph in a section produces a separately pipelined hardware module, synchronizing with others through fast memory — most commonly streams, which are instantiated as FIFO interfaces in hardware.
5.3. Compilation Pipeline
Compiling an SDFG for a specific target architecture comprises three steps: ❶ data dependency inference, ❷ code generation, and ❸ compiler invocation.
Before step ❶, the SDFG memlets are labeled with local subset/reindex sets (§3 and Appendix A). These sets need to be converted to absolute address sets in the data that is pointed to by the nodes. This information can then be used to generate exact memory copy calls (e.g., host-to-GPU) and addressing in the tasklet input connectors. We perform this by propagating subsets from data nodes, repeatedly applying the subset and reindex functions while storing the addresses.
The code generation process of an SDFG (step ❷) is hierarchical, so as to support different hardware architectures. The code generator begins by emitting C++ code for each state in the top-level SDFG, as well as the external interface code for interoperability. Within a state, every node is traversed in topological order, and a platform-specific dispatcher is assigned to generate the respective code, depending on the node’s storage or schedule type. The process continues recursively via map/consume scopes and nested SDFGs, thereby generating heterogeneous codes using more than one dispatcher. Between states, transitions are generated either by emitting for-loops and branches when detected, or by using conditional goto statements as a fallback. We see in practice that LLVM and gcc detect and optimize goto-based code during compilation, and performance is seldom impaired.
In step ❸, we invoke compiler(s) for the dispatched code according to the used dispatchers, such as nvcc for CUDA and xocc for the Xilinx SDAccel toolchain. To maintain portability across Linux, Windows, and macOS, we use CMake to manage the compilation process. It is possible to compile a shared object (.so/.dll file) once and cache the results. The compiled file can then be used directly by way of inclusion in existing applications, or through Python, using numpy arrays as parameters.
6. Performance Evaluation
We evaluate the performance of SDFGs on a set of fundamental kernels, followed by two case studies: a well-known benchmark suite, and a representative problem in parallel graph algorithms.
We run all of our experiments on a server that contains an Intel 12-core Xeon E5-2650 v4 CPU (clocked at 2.20GHz, HyperThreading disabled, 64 GB DDR4 RAM) and a Tesla P100 GPU (16 GB HBM2 RAM) connected over PCI Express. For FPGA results, we use a Xilinx VCU1525 board, hosting an XCVU9P FPGA and 4 DDR4 banks at 2400 MT/s. We run the experiments 30 times and report the median result, where the error-bars indicate the 25th and 75th percentiles of all runs of a specific experiment. For Polybench running times, we use the provided measurement tool, which reports the average time of 5 runs. All reported results were executed in hardware, including the FPGA.
Generated code from SDFGs is compiled using gcc 8 for CPU results, CUDA 9.2 for GPU, and Xilinx SDAccel 2018.2 for FPGAs. Flags used for compilation: -std=c++14 -O3 -march=native -ffast-math for CPU, -std=c++14 -arch sm_60 -O3 for GPU, and -std=c++11 -O3 for FPGA. Fundamental kernels use single-precision floating point types, Polybench uses the default experiment data-types (mostly double-precision), and graphs use integers.
6.1. Fundamental Computational Kernels
We begin by evaluating 5 core scientific computing kernels, implemented over SDFGs:
Matrix Multiplication (MM): Multiplies two 2,048-by-2,048 matrices.
Jacobi Stencil: A 5-point stencil repeatedly computed on a 2,048 square 2D domain for =1,024 iterations, with constant (zero) boundary conditions.
Histogram: Outputs the number of times each value (evenly binned) occurs in a 8,192 square 2D image.
Query: Filters a column of 67,108,864 elements according to a given condition (filters roughly 50%).
Sparse Matrix-Vector Multiplication (SpMV): Multiplies a CSR matrix (8,192 square, 33,554,432 non-zeros) with a dense vector.
Our results, shown in Fig. 13, are compared with Intel MKL (Intel, 2018) and HPX (Kaiser et al., 2014) for CPU; NVIDIA CUBLAS (NVIDIA, 2018a), CUSPARSE (NVIDIA, 2018b), and CUB (CUB, 2018), as well as Hybrid Hexagonal Tiling over PPCG (Verdoolaege et al., 2013) for GPU; Halide (Ragan-Kelley et al., 2013) (best of auto-tuned and manually tuned) for CPU and GPU; and Xilinx Vivado HLS/SDAccel (Xilinx, 2018b, a) and Spatial (Koeplinger et al., 2018) for FPGA.
On all applications, our SDFG results only employ data-centric transformations, keeping tasklets intact (§4.1). We highlight key results for all platforms below.
In MM, a highly tuned kernel, SDFGs achieve 98.6% of the performance of MKL, 70% of CUBLAS, and 90% of CUTLASS, which is the upper bound of a CUDA-based implementations of MM. On FPGA, SDFGs yield a result 4,992 faster than naïve HLS over SDAccel. We also run the FPGA kernel for matrices and compare to the runtime of reported for Spatial (Koeplinger et al., 2018) on the same VU9P chip. We measure , yielding a speedup of .
We observe similar results in SpMV, which is more complicated to optimize due to its irregular memory access characteristics. SDFGs are on par with MKL (99.9% performance) on CPU, and are successfully vectorized on GPUs.
For Histogram, our representation also enables vectorizing the program, achieving 8
the performance of gcc, where the others probably fail due to the kernel’s data-dependent accesses. We implement a two-stage kernel for FPGA, where the first stage reads 16 element vectors and scatters them to 16 parallel processing elements generated with map unrolling (similar to Fig.12), each computing a separate histogram. In the second stage, the histograms are merged on the FPGA before copying back the final result. This yields a speedup in hardware.
In Query, SDFGs are able to use streaming data access to parallelize the operation automatically, achieving significantly better results than both HPX and STL. On FPGA we read wide vectors, then use a deep pipeline to pack the sparse filtered vectors into dense vectors. This scheme yields a speedup, similar to Histogram.
Jacobi on CPU uses a custom transformation (Diamond-Tiling). We see that it outperforms standard implementations by two orders of magnitude (90 faster than Polly), and marginally outperforms Pluto, which uses a similar tiling technique. In Halide, when all timesteps are hard-coded in the pipeline, its auto-tuner yields the best result, which is 20% faster than SDFGs. For FPGAs, Jacobi is mapped to a systolic array of processing elements, allowing it to scale up with FPGA logic to . Overall, the results indicate that data-centric transformations can yield competitive and even state-of-the-art performance across both architectures and memory access patterns.
6.2. Case Study I: Polyhedral Applications
We run the Polybench (Pouchet, 2016) benchmark, without any optimizing transformations, over SDFGs. We show that the representation itself exposes enough parallelism to compete with state-of-the-art polyhedral compilers, outperform them on GPUs, and provide the first complete set of placed-and-routed Polybench kernels on an FPGA.
To demonstrate the wide coverage provided by SDFGs, we apply the FPGATransform automatic transformation to offload each polybench application to the FPGA during runtime, use our simulation flow to verify correctness, and finally perform the full placement and routing process. The same applies for GPUTransform (§ 5). We execute all kernels on the accelerators, including potentially unprofitable ones (e.g., including tasklets without parallelism).
The results are shown in Fig. 14, comparing SDFGs both with general-purpose compilers (green bars in the figure), and with pattern-matching and polyhedral compilers (blue bars; for a full list of tested flags, we refer to Appendix C). On the CPU, we see that for most kernels, the performance of unoptimized SDFGs is closer to that of the polyhedral compilers than to the general-purpose compilers. The cases where SDFGs are on the order of standard compilers are solvers (e.g., cholesky, lu) and simple linear algebra (e.g., gemm). In both cases, data-centric transformations are necessary to optimize the computations, which we exclude from this test in favor of demonstrating SDFG expressibility.
On the GPU, in most cases, SDFGs generate code that outperforms PPCG, a tool specifically designed to transform polyhedral applications to GPUs. In particular, the bicg GPU SDFG is 11.8 faster than PPCG. We attribute these speedups to the inherent parallel construction of the data-centric representation, as well as avoiding unnecessary array copies due to explicit data dependencies.
6.3. Case Study II: Graph Algorithms
For our second case study, we implement an irregular computation problem on multi-core CPUs: Breadth-First Search (BFS). BFS traverses a graph from a given source node, outputting the number of edges traversed from the source node to each destination node. We use the data-driven push-based algorithm, and test 5 graphs with different sizes and characteristics (e.g., diameter), as shown in Table 2.
|USA (dim, 2010)||24M||58M||2.41||9||0.62|
|OSM-eur-k (kar, 2014)||174M||348M||2.00||15||3.90|
|soc-LiveJournal1 (Davis and Hu, 2011)||5M||69M||14.23||20,293||0.56|
|twitter (Cha et al., 2010)||51M||1,963M||38.37||779,958||16.00|
|kron21.sym (Bader et al., 2013)||2M||182M||86.82||213,904||1.40|
Due to the irregular nature of the algorithm, BFS is not a trivial problem to optimize. However, SDFGs inherently support constructing the algorithm using streams and data-dependent map ranges (see Appendix A). The primary state of the optimized SDFG is shown in Fig. 15, which contains only 14 nodes (excluding input/output data).
We compare our results with two state-of-the-art CPU graph processing frameworks: Galois (Nguyen et al., 2013) and Gluon (Dathathri et al., 2018). We use the default settings (bfs_push for Gluon, SyncTile for Galois) and use 12 threads (1 thread per core).
In Fig. 16, we see that SDFGs perform on-par with the frameworks on all graphs, where Galois is marginally faster on social networks (1.53 on twitter) and the Kronecker graph. However, in road maps (usa, osm-eur) SDFGs are up to 2 faster than Galois. This result could stem from our fine-grained scheduling imposed by the map scopes.
7. Related Work
With respect to the current hardware architecture landscape, there have been several recent works that locally address some of the issues posed in this paper. We discuss those papers below, and summarize them in Fig. 17.
Separation of Concerns
Multiple frameworks explicitly separate the computational algorithm from subsequent optimization schemes. In CHiLL (Chen et al., 2008), a user may write high-level transformation scripts for existing code, describing sequences of loop transformations. These scripts are then combined with C/C++, FORTRAN or CUDA (Rudy et al., 2011) programs to produce optimized code using the polyhedral model. Image processing pipelines written in the Halide (Ragan-Kelley et al., 2013) embedded DSL are defined as operations, whose schedule is separately generated in the code by invoking commands such as tile, vectorize, and parallel. Tiramisu (Baghdadi et al., 2018) operates in a similar manner, enabling loop and data manipulation. In SPIRAL (Franchetti et al., 2018), high-level specifications of computational kernels are written in a DSL, followed by using breakdown and rewrite rules to lower them to optimized algorithmic implementations. SDFGs, along with DIODE and the data-centric transformation workflow, streamlines such approaches and promotes an interactive solution that enables knowledge transfer of optimization techniques across applications via custom, visual transformations.
The LLVM IR (Lattner and Adve, 2004) is a control-flow graph composed of basic blocks of statements. Each block is given in Single Static Assignment form and can be transformed into a dataflow DAG. HPVM (Kotsifakou et al., 2018) extends the LLVM IR by introducing hierarchical dataflow graphs for mapping to accelerators. These IRs operate on a lower level representation than the SDFG, which keeps tasklets opaque and abstracts parallelism via parametric graphs. Other parametric graph representations include Dryad (Isard et al., 2007), which is intended for coarse-grained distributed data-parallel applications. However, Dryad dataflow is acyclic, and thus it was improved with Naiad (Murray et al., 2013), which enables the definition of loops in a nested context. As SDFGs provide general-purpose state machines in the outer graph, all Dryad and Naiad programs can be fully represented within our model, which also encapsulates fine-grained data dependencies.
Several representations provide a fixed set of data-centric transformations. Halide’s schedules are by definition data-centric, and the same applies to polyhedral loop transformations in CHiLL. HPVM also applies a number of optimization passes on a higher level than LLVM, such as tiling, node fusion, and mapping of data to GPU constant memory. Lift (Steuwer et al., 2015) programs are written in a high-level functional language with predefined primitives (e.g., map, reduce, split), while a set of rewrite rules is used to optimize the expressions and map them to OpenCL constructs. Loop transformations and the other aforementioned coarse-grained optimizations are all contained within our class of data-centric graph-based transformations, which can express arbitrary data movement patterns.
Decoupling Data Access and Computation
In the Sequoia (Fatahalian et al., 2006) programming model, all communication is made explicit by having tasks exchange data only through argument passing and calling subtasks. Chapel (Chamberlain et al., 2007) has built-in features for controlling and reasoning about locality. MAPS (Ben-Nun et al., 2015) separates data accesses from computation by coupling data with their memory access patterns. The Legion (Bauer et al., 2012) programming model and runtime system organizes information about locality and dependencies of data in logical regions. This category also includes all frameworks that implement the polyhedral model, including CHiLL, PENCIL (Baghdadi et al., 2015), Pluto (Bondhugula et al., 2008), Polly (Grosser et al., 2012) and Tiramisu. Furthermore, this concept can be found in models for graph analytics (Nguyen et al., 2013; Dathathri et al., 2018), stream processing (Becker et al., 2016; Thies et al., 2002), machine learning (Abadi et al., 2016), as well as many libraries (Edwards et al., 2014; Group, 2018b; An et al., 2003). Such models enable automatic optimization by reasoning about accessed regions. However, in this traditional approach, it is assumed that the middleware will carry most of the burden of optimization, and thus frameworks are tuned for existing memory hierarchies and architectures. This does not suffice for fine-tuning kernels, nor is it portable to new architectures. In these cases, a performance engineer typically resorts to a full re-implementation of the algorithm, as opposed to the workflow proposed here, where SDFG transformations can be customized or extended.
Multi-Target Intermediate Representations
The concept of a single representation for different architectures has been the driving motivation for compiler IRs such as LLVM, and parallel programming libraries like OpenACC, OpenCL (Group, 2018a) and OpenMP (Dagum and Menon, 1998). OpenCL and frameworks such as CHiLL, Lift, Halide, HPVM, SPIRAL, and Tiramisu, support imperative and massively parallel architectures (CPUs, GPUs). For hardware architectures, the trend is to extend existing programming paradigms after the fact, such as OpenACC (Lee et al., 2016) and OpenCL (Czajkowski et al., 2012), to generate FPGA code. The Halide and Tiramisu frameworks have also been extended by the FROST (Sozzo et al., 2018) back-end to target FPGA kernels. In contrast, SDFGs are designed to natively support multiple targets, spanning both load/store architectures and reconfigurable hardware.
In essence, SDFGs provide the expressiveness of a general-purpose programming language, while enabling productive high-performance optimization capabilities that do not interfere with the original scientific code. Unlike previous works, the SDFG is not limited to specific application classes or hardware architectures, and the extensible data-centric transformations generalize existing code optimization approaches.
In this paper, we present a novel data-centric intermediate representation for producing high-performance computing applications. Leveraging dataflow tracking and graph rewriting, we enable the role of a performance engineer — a developer who is well-versed in program optimization, but does not need to comprehend the underlying domain-specific mathematics. We show that by performing transformations on an SDFG alone, i.e., without modifying the given computational tasklets, it is possible to achieve performance comparable to the state-of-the-art on three fundamentally different platforms.
The IR proposed in this paper can be extended in several directions. Given a collection of transformations, research may be conducted into their systematic application, enabling automatic optimization with reduced human intervention. Another direction is to study the application of SDFGs to distributed systems, such as clusters, in which data movement minimization is significantly more important.
Acknowledgements.We thank Hussein Harake, Colin McMurtrie, and the whole CSCS team granting access to the Greina and Daint machines, and for their excellent technical support. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 programme (grant agreement DAPP, No. 678880). T.B.N. is supported by the ETH Zurich Postdoctoral Fellowship and Marie Curie Actions for People COFUND program.
- dim (2010) 2010. 9th DIMACS Implementation Challenge. http://www.dis.uniroma1.it/challenge9/download.shtml
- kar (2014) 2014. Karlsruhe Institute of Technology, OSM Europe Graph. http://i11www.iti.uni-karlsruhe.de/resources/roadgraphs.php
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Berkeley, CA, USA, 265–283.
- Adinetz (2014) Andrew Adinetz. 2014. Optimized Filtering with Warp-Aggregated Atomics. (2014). http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics
- An et al. (2003) Ping An, Alin Jula, Silvius Rus, Steven Saunders, Tim Smith, Gabriel Tanase, Nathan Thomas, Nancy Amato, and Lawrence Rauchwerger. 2003. STAPL: An Adaptive, Generic Parallel C++ Library. Springer Berlin Heidelberg, Berlin, Heidelberg, 193–208.
- Bader et al. (2013) David A. Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner (Eds.). 2013. Graph Partitioning and Graph Clustering, 10th DIMACS Implementation Challenge Workshop, Georgia Institute of Technology, Atlanta, GA, USA, February 13-14, 2012. Proceedings. Contemporary Mathematics, Vol. 588. American Mathematical Society.
- Baghdadi et al. (2015) Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alastair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. 2015. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT ’15). IEEE Computer Society, Washington, DC, USA, 138–149. https://doi.org/10.1109/PACT.2015.17
- Baghdadi et al. (2018) Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Patricia Suriana, Shoaib Kamil, and Saman P. Amarasinghe. 2018. Tiramisu: A Code Optimization Framework for High Performance Systems. CoRR abs/1804.10694 (2018). arXiv:1804.10694 http://arxiv.org/abs/1804.10694
- Bauer et al. (2012) Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages.
- Becker et al. (2016) Tobias Becker, Oskar Mencer, and Georgi Gaydadjiev. 2016. Spatial Programming with OpenSPL. Springer International Publishing, Cham, 81–95.
- Ben-Nun et al. (2015) Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. 2015. Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). ACM, Article 19, 12 pages.
- Bondhugula et al. (2008) Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Program Optimization System. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
- Bruni and Montanari (2017) Roberto Bruni and Ugo Montanari. 2017. Operational Semantics of IMP. Springer International Publishing, Cham, 53–76. https://doi.org/10.1007/978-3-319-42900-7_3
- Cha et al. (2010) Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and P Krishna Gummadi. 2010. Measuring User Influence in Twitter: The Million Follower Fallacy. ICWSM 10, 10-17 (2010), 30.
- Chamberlain et al. (2007) B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications 21, 3 (2007), 291–312. https://doi.org/10.1177/1094342007078442
- Chen et al. (2008) Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. University of Southern California.
- Cordella et al. (2004) Luigi P. Cordella, Pasquale Foggia, Carlo Sansone, and Mario Vento. 2004. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (Oct 2004), 1367–1372. https://doi.org/10.1109/TPAMI.2004.75
- CUB (2018) CUB 2018. CUB Library Documentation. http://nvlabs.github.io/cub/.
- Czajkowski et al. (2012) T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh. 2012. From OpenCL to high-performance hardware on FPGAs. In 22nd International Conference on Field Programmable Logic and Applications (FPL). 531–534. https://doi.org/10.1109/FPL.2012.6339272
- Dagum and Menon (1998) Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng. 5, 1 (Jan. 1998), 46–55. https://doi.org/10.1109/99.660313
- Dathathri et al. (2018) Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, and Keshav Pingali. 2018. Gluon: A Communication-optimizing Substrate for Distributed Heterogeneous Graph Analytics. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 752–768. https://doi.org/10.1145/3192366.3192404
- Davis and Hu (2011) Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (2011), 25 pages.
- de Fine Licht et al. (2018) Johannes de Fine Licht, Simon Meierhans, and Torsten Hoefler. 2018. Transformations of High-Level Synthesis Codes for High-Performance Computing. arXiv preprint arXiv:1805.08288 (2018).
- Developers (2018) NumPy Developers. 2018. NumPy Scientific Computing Package. http://www.numpy.org
- Edwards et al. (2014) H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202 – 3216. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
- Ehrig et al. (2006) H. Ehrig, K. Ehrig, U. Prange, and G. Taentzer. 2006. Fundamentals of Algebraic Graph Transformation (Monographs in Theoretical Computer Science. An EATCS Series). Springer-Verlag, Berlin, Heidelberg.
- Fatahalian et al. (2006) Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). ACM, New York, NY, USA, Article 83. https://doi.org/10.1145/1188455.1188543
- Foundation (2018) Python Software Foundation. 2018. The Python Programming Language. https://www.python.org
- Franchetti et al. (2018) Franz Franchetti, Tze-Meng Low, Thom Popovici, Richard Veras, Daniele G. Spampinato, Jeremy Johnson, Markus Püschel, James C. Hoe, and José M. F. Moura. 2018. SPIRAL: Extreme Performance Portability. Proceedings of the IEEE, special issue on “From High Level Specification to High Performance Code” 106, 11 (2018).
- Goto and Geijn (2008) Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-performance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages. https://doi.org/10.1145/1356052.1356053
- Grosser et al. (2012) Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly — Performing Polyhedral Optimizations on a Low-Level Intermediate Representation. Parallel Processing Letters 22, 04 (2012), 1250010. https://doi.org/10.1142/S0129626412500107
- Group (2018a) Khronos Group. 2018a. OpenCL. https://www.khronos.org/opencl
- Group (2018b) Khronos Group. 2018b. OpenVX. https://www.khronos.org/openvx
- Intel (2018) Intel. 2018. Math Kernel Library (MKL). https://software.intel.com/en-us/mkl
- Isard et al. (2007) Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys ’07). ACM, New York, NY, USA, 59–72. https://doi.org/10.1145/1272996.1273005
- Kaiser et al. (2014) Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS ’14). ACM, New York, NY, USA, Article 6, 11 pages. https://doi.org/10.1145/2676870.2676883
- Koeplinger et al. (2018) David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A Language and Compiler for Application Accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 296–311. https://doi.org/10.1145/3192366.3192379
- Kotsifakou et al. (2018) Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous Parallel Virtual Machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). ACM, New York, NY, USA, 68–80. https://doi.org/10.1145/3178487.3178493
- Lattner and Adve (2004) Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In CGO. San Jose, CA, USA, 75–88.
- Lee et al. (2016) S. Lee, J. Kim, and J. S. Vetter. 2016. OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 544–554. https://doi.org/10.1109/IPDPS.2016.28
- Löwe (1993) Michael Löwe. 1993. Algebraic Approach to Single-pushout Graph Transformation. Theor. Comput. Sci. 109, 1-2 (March 1993), 181–224. https://doi.org/10.1016/0304-3975(93)90068-5
- Murray et al. (2013) Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA, 439–455. https://doi.org/10.1145/2517349.2522738
- Nguyen et al. (2013) Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA, 456–471. https://doi.org/10.1145/2517349.2522739
- NVIDIA (2018a) NVIDIA. 2018a. CUBLAS Library Documentation. http://docs.nvidia.com/cuda/cublas.
- NVIDIA (2018b) NVIDIA. 2018b. CUSPARSE Library Documentation. http://docs.nvidia.com/cuda/cusparse.
- Pouchet (2016) L. N. Pouchet. 2016. PolyBench: The Polyhedral Benchmark suite. https://sourceforge.net/projects/polybench
- Ragan-Kelley et al. (2013) Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). ACM, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176
- Rudy et al. (2011) Gabe Rudy, Malik Murtaza Khan, Mary Hall, Chun Chen, and Jacqueline Chame. 2011. A Programming Language Interface to Describe Transformations and Code Generation. In Languages and Compilers for Parallel Computing, Keith Cooper, John Mellor-Crummey, and Vivek Sarkar (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 136–150.
- Sozzo et al. (2018) Emanuele Del Sozzo, Riyadh Baghdadi, Saman P. Amarasinghe, and Marco D. Santambrogio. 2018. A Unified Backend for Targeting FPGAs from DSLs. 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP) (2018), 1–8.
- Steuwer et al. (2015) Michel Steuwer, Christian Fensch, Sam Lindley, and Christophe Dubach. 2015. Generating Performance Portable Code Using Rewrite Rules: From High-level Functional Expressions to High-performance OpenCL Code. SIGPLAN Not. 50, 9 (Aug. 2015), 205–217. https://doi.org/10.1145/2858949.2784754
- Team (2018) SymPy Development Team. 2018. SymPy Symbolic Math Library. http://www.sympy.org
- Thies et al. (2002) William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction (CC ’02). Springer-Verlag, London, UK, UK, 179–196.
- Unat et al. (2017) D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems 28, 10 (Oct 2017), 3007–3020. https://doi.org/10.1109/TPDS.2017.2703149
- Verdoolaege et al. (2013) Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1–54:23. https://doi.org/10.1145/2400682.2400713
- Xilinx (2018a) Xilinx. 2018a. SDAccel. https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html
- Xilinx (2018b) Xilinx. 2018b. Vivado HLS. https://www.xilinx.com/products/design-tools/vivado
- Zhou and Demsky (2010) Jin Zhou and Brian Demsky. 2010. Bamboo: A Data-centric, Object-oriented Approach to Many-core Software. SIGPLAN Not. 45, 6 (June 2010), 388–399. https://doi.org/10.1145/1809028.1806640
Appendix A SDFG Operational Semantics
We begin by defining the different elements of our IR, which is a graph of graphs. We follow by defining how a function expressed in our IR must be called and then give semantic rules of how an SDFG is evaluated. We precede each operational semantic rule by a less formal description of the intent of the rule.
a.1. Elements of an SDFG
An SDFG is a directed multigraph defined by the tuple , whose vertex set represent states, its edges represent transitions between them, and has a name and one start-state . It is a multigraph since there can be multiple transitions between a pair of states.
An SDFG state is a named, directed acyclic multigraph defined by the tuple . Each node is of one of the following types, as shown in Table 1 of the paper: data, tasklet, mapentry, mapexit, stream, reduce, consume-entry, consume-exit, invoke. Each node type defines connectors — attachment points for edges defining the nature of the connection, and the edges indicate dependencies which constrain the execution order. Each edge carries a memlet, an annotation that specifies dataflow, as well as the connectors on its source and destination nodes. Below, we describe the node types in SDFG states and the exact definition of memlets.
As opposed to inherently sequential representations (cf. (Bruni and Montanari, 2017, Rule 3.15)), in SDFGs the execution order is mainly constrained by explicit dependencies.
To parameterize some aspects of the graph, there exists a global namespace for symbols. Symbols can be used in all symbolic expressions mentioned below, and at runtime they evaluate to scalar values. (§ A.2).
A data node represents a typed location in memory. The memory itself can be allocated outside of the SDFG and passed as a pointer upon invocation, or allocated at runtime if transient. A data node is a tuple (name, basetype, dimensions, transient). The name is an identifier, the basetype a basic type (e.g., int, float, double), and dimensions a list of symbolic integer expressions. Memory pointed to by differently named data nodes must not alias. A data node has an implicit connector for each adjacent edge.
A tasklet node represents computation. It is defined by the tuple (inputs, outputs, code), where inputs and outputs are sets of elements of the form (name, basetype, dimensions) that define connectors with the same name, representing the only external memory the code can read or write. Code is a string, which can be written in Python or languages supported by the target architecture, such as C++.
A stream node represents an array of typed locations in memory, accessed using queue semantics, i.e., using push and pop operations. A stream has the input connectors data and push, as well as the output connectors data and pop. The data connectors allow initialization/copying of the data stored in the queues from/to an array, push and pop enqueue and dequeue elements, respectively.
The memlet annotation represents dataflow,
defined by the tuple (src_conn, dst_conn, subset, reindex, accesses, wcr).
The subset function selects which
subset of elements visible at the source connector will flow to the destination
connector. The reindex function specifies at which indices the data will
be visible at the destination node. We express subset and reindex as functions on integer sets, and implement them as lists of exclusive ranges, where each range refers to one data dimension and defined by start:end:stride:tilesize
start:end:stride:tilesize. tilesize is used to propagate multiple elements at a time, e.g., vector types. Fig. 18 extends Fig. 5b from the paper by showing the same memlets in subset/reindex notation rather than array index labels.
The wcr optional attribute (write-conflict resolution) is a function for a data type . The function receives the old value currently present at the destination and the new value, and produces the value that will be written. This attribute is normally defined when data flows concurrently from multiple sources, e.g., in a parallel dot product.
Map entry/exit nodes are defined by a tuple (range, inputs, outputs), creating a scope. The subgraph within the scope of a map is defined by the nodes that are dominated by the map-entry and post-dominated by the map-exit. This subgraph is expanded at runtime and executed concurrently according to the symbolic range of the mapentry, which takes the form identifier=begin:end:step. Input and output connectors of map nodes are either defined as IN_*/OUT_* (* is any string) to map memlets outside the scope to memlets within it; or other identifiers, which can be used to create data-dependent map ranges. The range identifier becomes a scope-wide symbol, and can used to define memlet attributes (subset, reindex).
The invoke node allows to call an SDFG within an SDFG state. Its semantics are equivalent to a tasklet that has input connectors for each data-node and undefined symbol used in the invoked SDFG, and an output node for each data node.
The reduce alias node is used to implement fast, target-dependent reduction procedures. It is defined by axes and a wcr function, consisting of an input and an output connector (of the same type). It is semantically equivalent to a map whose range spans the incoming and outgoing memlet data, containing an identity tasklet (output=input) and an output memlet annotated by the given wcr for the reduced axes.
Consume entry/exit nodes complement maps, enabling producer/consumer relationships via dynamic parallel processing of streams. The two nodes form a scope, similarly to a map, but also accept a stream_in connector (connected to a stream), and a quiescence condition after which processing stops. Thus, a consume entry node is a tuple (range, cond, inputs, outputs). Semantically, a consume scope is equivalent to a map using the same , with an internal invoked SDFG containing the scope. This state connects back to itself without any assignments and the condition .
A state transition is defined as a tuple (source, destination, condition, assignments). Condition can depend on symbols and data from data nodes in the source state, whereas assignments take the form symbol = expression. Once a state finishes execution, all outgoing state transitions of that state are evaluated in an arbitrary order, and the destination of the first transition whose condition is true is the next state which will be executed. If no transition evaluates to true, the program terminates. Before starting the execution of the next state, all assignments are performed, and the left-hand side of assignments become symbols.
a.2. Operational Semantics
We denote collections (sets/lists) as capital letters and their members with the corresponding lowercase letter and a subscript, i.e., in an SDFG the set of states as , with . Without loss of generality we assume to be the start state. We denote the value stored at memory location as , and assume all basic types are size-one elements to simplify address calculations.
The state of execution is denoted by . Within the state we carry several sets: , which maps names of data nodes and transients to memory addresses; , which maps symbol names (identifiers) to their current value; and , which maps connectors to the data visible at that connector in the current state of execution.
We define a helper function , which returns the product of all dimensions of the data node or element given as argument (using to resolve symbolic values). Furthermore, returns the name property of a data or transient node, and offs() the offset of a data element relative to the start of the memory region it is stored in. The function creates a copy of the object given as argument, i.e., when we modify the copy, the original object remains the same.
When an SDFG is called with the data arguments ( is an identifier, is an address/pointer) and symbol arguments ( is an identifier, an integer) we initialize the configuration :
For all symbols in : .
For all data and stream nodes without incoming edges s.t. : .
Set to a copy of the start state of , .
Set to .
Set to zero for all stream nodes .
This can be expressed as the following rule:
a.2.2. Propagating Data in a State
Execution of a state entails propagating data along edges, governed by the rules defined below.
Data is propagated within a state in an arbitrary order, constrained only by dependencies expressed by edges. We begin by considering all nodes in the current state, which we add to current. We gradually remove nodes as they are processed, until current is empty and we can proceed to the next state or termination.
To keep the description concise, we omit details of address translation and indexing. We also use the subset and reindex properties as functions that resolve symbols and return data elements directly.
In each step, we take one element (either a memlet or a node) of ,
for which all input connectors have visible data, then:
If is a memlet (src, dst, subset, reindex, wcr), update to :
If is a data node, update its referenced memory for an input connector ,
If is a tasklet, generate a prologue that allocates local variables for all input connectors of , initialized to (), as well as output connectors (). Generate an epilogue which updates for each output connector of with the contents of the appropriate variable (declared in ). Execute the concatenation of .
If is a mapentry node with range ( is the identifier) and scope : Remove from current. Remove and the corresponding map exit node from . For each element in , replicate , resolve any occurrence of to , connect incoming connectors of and in .
If is a consume-entry node, defined by (range, cond, cin, cout), replace with a mapentry and do the same for the corresponding consume exit node. Then we create a new SDFG , which contains the contents of the original consume scope . consists of one state , and a single state transition to the same state with the condition , defined by . Finally, we replace in current with an invoke node for and reconnect the appropriate edges between the entry and exit nodes.
If is a reduce node defined by the tuple (cin, cout, range), we create a mapentry node with the same range, a mapexit node , and a tasklet o = i. We add these nodes to the node set of current, nd(current). We connect them by adding edges to the edge set of current.
If is a stream, we add the data visible at the push connector to the
appropriate memory location (indicated by ), and increment it.
A stream node with an edge connected to its pop connector can only be evaluated if . When propagating through streams, there are three cases for one step:
❶ Data is both pushed into and popped from the stream:
❷ Data is only pushed to the stream node:
❸ Data is popped from the stream but not pushed into it:
Following the element processing step, we remove from current, repeating the above step until current is empty.
a.2.3. Evaluating State Transitions
Once current is empty, we evaluate all outgoing state transitions of the current state: . For each transition, we resolve all symbols in and the right-hand sides of using , then evaluate arithmetic and boolean expressions using standard semantic rules, which we omit here. If no condition evaluates to true, signal the completion of to the caller and stop the evaluation of :
Otherwise, we choose an arbitrary transition for which and update : Set to , set to a copy of . For each left-hand side of an assignment , update with the value of the corresponding right-hand side . Data propagation then follows Section A.2.2:
Appendix B Data-Centric Graph Transformations
The implementation of graph transformations is based on algebraic graph rewriting (Ehrig et al., 2006). Each transformation is implemented by a rule , which consists of the two sub-graphs and , while is a relation that maps the vertices and edges of to elements of the same kind in . Moreover, a specific matching of in the SDFG is represented by a relation . Applying the optimization on produces the transformed graph , which can be constructed as part of the pushout of , and . A visualization of the above method, also known as the single-pushout approach (Löwe, 1993), is shown in Fig. 19.
|MapCollapse||Collapses two nested maps into one. The new map has the union of the dimensions of the original maps.|
|MapExpansion||Expands a multi-dimensional map to two nested ones. The dimensions are split to two disjoint subsets, one for each new map.|
|MapFusion||Fuses two consecutive maps that have the same dimensions and range.|
|MapInterchange||Interchanges the position of two nested maps.|
|MapReduceFusion||Fuses a map and a reduction node with the same dimensions, using conflict resolution.|
|MapTiling||Applies orthogonal tiling to a map.|
|DoubleBuffering||Pipelines writing to and processing from a transient using two buffers.|
|LocalStorage||Introduces a transient for caching data.|
|Vectorization||Alters the data accesses to use vectors.|
|MapToForLoop||Converts a map to a for-loop.|
|StateFusion||Fuses two states into one.|
|Hardware mapping transformations|
|FPGATransform||Converts a CPU SDFG to be fully invoked on an FPGA, copying memory to the device.|
|GPUTransform||Converts a CPU SDFG to run on a GPU, copying memory to it and executing kernels.|
Appendix C Polybench Compiler Flags
We compile Polybench code for all compilers using the following base flags: -O3 -march=native -mtune=native. Each compiler was also tested with different variants of flags, where we report the best-performing result in the paper. The flags are (variants separated by semicolons):
gcc 8.2.0: -O2; -O3
clang 6.0: Base
icc 18.0.3: Base; -mkl=parallel -parallel
Polly (over clang 6.0): -mllvm -polly;
-mllvm -polly -mllvm -polly-parallel -lgomp
Pluto 0.11.4: -ftree-vectorize -fopenmp and
among best of --tile; --tile --parallel;
--tile --parallel --partlbtile;
--tile --parallel --lbtile --multipar
PPCG 0.8: Base
All variants were also tested with compile-time size specialization.