AdaptMemBench: Application-Specific MemorySubsystem Benchmarking

12/19/2018
by   Mahesh Lakshminarasimhan, et al.
Boise State University
0

Optimizing scientific applications to take full advan-tage of modern memory subsystems is a continual challenge forapplication and compiler developers. Factors beyond working setsize affect performance. A benchmark framework that exploresthe performance in an application-specific manner is essential tocharacterize memory performance and at the same time informmemory-efficient coding practices. We present AdaptMemBench,a configurable benchmark framework that measures achievedmemory performance by emulating application-specific accesspatterns with a set of kernel-independent driver templates. Thisframework can explore the performance characteristics of a widerange of access patterns and can be used as a testbed for potentialoptimizations due to the flexibility of polyhedral code generation.We demonstrate the effectiveness of AdaptMemBench with casestudies on commonly used computational kernels such as triadand multidimensional stencil patterns.

READ FULL TEXT VIEW PDF

Authors

page 2

page 5

page 9

03/09/2021

MapVisual: A Visualization Tool for Memory Access Patterns

Memory bandwidth is strongly correlated to the complexity of the memory ...
10/24/2019

Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications

Modern optimizing compilers are able to exploit memory access or computa...
11/26/2018

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

In this paper, we present benchmark data for Intel Memory Drive Technolo...
06/04/2021

Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

In recent years, the need for neutral benchmark studies that focus on th...
09/08/2020

SGX-MR: Regulating Dataflows for Protecting Access Patterns of Data-Intensive SGX Applications

Intel SGX has been a popular trusted execution environment (TEE) for pro...
08/25/2019

Multidimensional Phase Recovery and Interpolative Decomposition Butterfly Factorization

This paper focuses on the fast evaluation of the matvec g=Kf for K∈C^N× ...
09/09/2019

Improving the scalabiliy of neutron cross-section lookup codes on multicore NUMA system

We use the XSBench proxy application, a memory-intensive OpenMP program,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Scientific application performance is a function of memory bandwidth, instruction mix and order, memory footprint, and memory access patterns. The contribution of each is often not clear and interdependencies exist between each variable. This complexity, combined with the difficulty of instrumenting large application makes efficient optimization of these applications difficult. AdaptMemBench provides a framework for application developers and optimization experts to isolate portions of their application and measure execution characteristics. The framework provides a starting point to identify performance bottlenecks, identify potential optimizations, and explore the potential gains of those optimizations.

Application performance is often bottlenecked by interaction with the memory subsystem due to the memory wall [27]. Modern architectures combat this by using deep memory hierarchies and physically fragmented system memory. Reducing working set sizes is considered a good first step in optimization to take advantage of the caching capability of machines. However, optimizing is more complex than that, especially when dealing with shared memory parallelization. Memory access patterns, instruction mix, data sharing across caches, and vectorizability must all be considered in concert.

Selecting and applying optimizations remains a primary challenge during performance enhancements. Testing and understanding optimizations in situ when working with a large application can be cumbersome and error prone. Given the difficulty around manipulating access patterns in situ, fewer optimization strategies are attempted and potential performance improvements are overlooked. Additionally, performance tools such as hardware counters, remain difficult to use in the context of a large application. The combination of these factors discourages effective optimizations.

A framework that allows extracted code to be isolated and measured will benefit the optimization process for specific projects, and will improve the reliability and reproducibility of performance experiments in the compiler optimization and programming construct research communities. During the exploration and experimentation phase, many different variants of the same code are produced. Tracking the differences between variants and maintaining correct execution becomes time consuming and challenging. A shared framework that supports experimentation and tracks code versions while outputting metadata with measurements will ease this challenge.

We propose a tool to explore the design landscape of the target architecture. The AdaptMemBench framework can be used to measure system performance and to guide application-specific optimization decisions. Expensive kernels extracted from larger applications can be manipulated in isolation to find the best optimization strategies. The framework reduces the amount of code that is transferred and provides mechanisms to experiment with data storage layout, execution order, and parallelization strategies.

AdaptMemBench provides several execution templates. The templates are combined with user provided code segments. The templates provide a common command line interface, handle all timing and hardware counter code, and output metadata and measurements in a common format. The code segments provided by the user can be expressed as C code or by using the polyhedral model. The latter provides a convenient mechanism for optimization experiments.

Several benchmarks [14, 10, 19, 18, 20, 2, 13] exist that measure machine performance, with the benchmarking results conveying essential information about the application performance on the memory hierarchy of the machine. Existing memory benchmarks [14, 19, 18] measure performance using a limited collection of streaming access patterns. However, benchmarking application-specific patterns that tend to be more complex remains a challenge. Current benchmarks [15, 10] are further constrained by the data sizes which can be executed, specifically in the higher levels of the memory subsystem.

AdaptMemBench differs from previous efforts by incorporating polyhedral code generation. This creates a configurable benchmarking framework that measures achieved memory bandwidth while mimicking application-specific memory access patterns. The Polyhedral model [25] simplifies writing the initial benchmark and provides a mechanism to automatically transform the code. Furthermore, our benchmark supports parallel applications and systems, and measures memory performance for data sizes across all levels of the memory hierarchy.

The primary contribution of this paper is a description and various demonstrations of the AdaptMemBench framework. Additionally, the framework was used to explore the performance of our university’s HPC cluster. The contributions of this paper include the following:

  • A configurable benchmarking framework for application-specific memory performance characterization.

  • A detailed performance study on common computational kernels found in scientific applications for the impact of implicit locks, shared data spaces and false sharing.

  • An interleaved optimization strategy and demonstrated effectiveness for the triad pattern.

  • An evaluation of the efficacy of spatial tiling strategies for multidimensional Jacobi patterns using AdaptMemBench.

Fig. 1: The proposed framework.

Ii AdaptMemBench Design

The AdaptMemBench framework separates the user interface, validation, and output of the benchmark from the code being measured and provide low overhead access to PAPI. Figure 1 illustrates the building blocks of the framework. Each computational kernel of interest is coded in a pattern specification. If that pattern specification involves the polyhedral model it is passed through a polyhedral compiler. The resulting (or original) c code is compiled together with one of several potential templates. The templates provide a uniform interface and handle code to vary the working set size to cover each portion of the memory hierarchy, along with timing, PAPI data collection, and output formatting. The use of the polyhedral model adds a great deal of flexibility in terms of exploring optimizations. The following subsection provides a brief overview of the polyhedral model. After the overview, the benchmark framework is described.

Ii-a Polyhedral Code Generation

Polyhedral code generation enables loop constructs to be expressed and manipulated mathematically. The iteration sets can be expressed without ordering unless a specific ordering is required. Figure 2 shows a loop nest for solving the heat equation. The associated iteration space is shown graphically as a two-dimensional space . Each node in the graph represents an iteration. The Presburger formula for this example is shown at the bottom of the figure.

Fig. 2: An example of polyhedral code generation with ISCC/ISL.

Code generation is performed on sets through polyhedral scanning, the result is control flow that produces the iterations in lexicographical ordering. As expressed in Figure 2 the original code would be produced. Transformations on the code are realized through the application of relations (or functions).

Loop interchange is a loop transformation that switches the order of two loops. Figure 3 shows the relation used to apply loop interchange for the code in Figure 2. For the relation from {i,j} to {j,i}, we apply the transformation on the execution domain defined, using the intersection operator. More complex transformations such as tiling can be performed with ease using the polyhedral model.

Fig. 3: An illustration of loop interchange using ISCC.

The polyhedral model represents iteration spaces that are affine. A significant amount of work has been done to expand the iteration spaces and schedules that can be represented, including work that uses schedule trees for code generation within ISL [24]. The Omega+ code generation tool is also able to incorporate iteration bounds based on runtime information using uninterpreted functions [11]. Even with recent advances, the polyhedral model cannot express all C kernels, and is, therefore, an optional step in the benchmark specification.

In the proposed benchmark, to automatically generate schedules for the application kernel initialization, execution, and validation, the ISCC [25] polyhedral code generation tool is used, which offers an interface to the functionality provided by Integer Set Library (ISL) [24] and Barvinok library [23]. This tool enables the end user to manipulate sets and relations and generate source code reflecting their input.

1...
2//Execution
3for(int k = 0; k < ntimes; k++) {
4    #pragma omp parallel for CLAUSE
5    #include "<kernel>_run.c"
6}
7...
Listing 1: The inner-most section of the Unified Data Spaces Template.
1...
2//Execution
3#pragma omp parallel
4{
5  int t_id = omp_get_thread_num();
6  for(int k = 0; k < ntimes; k++) {
7      #include "<kernel>_run.c"
8  }
9}
10...
Listing 2: The inner-most section of the Independent Data Spaces Template.

Ii-B Benchmark Implementation

The proposed framework uses a set of generic benchmark driver templates for all variations of the access patterns. These driver templates provide a standard command line interface and a standard machine parsable and human readable output. Currently, the framework supports the following three varieties of benchmark driver templates for shared memory applications:

  1. The Unified Data Spaces Template (Listing 1): The standard benchmarking template that utilizes unified data spaces shared among threads. It uses the work sharing and scheduling constructs offered by OpenMP to distribute resources among threads. The OpenMP clauses can be easily configured using the framework.

  2. The Independent Data Spaces Template (Listing 2): This is a modified version of the unified data spaces template. It supports distinct data spaces separated into different memory regions accessed without any overlap, avoiding false sharing. As indicated by the experimental results that follow, benchmarking in this paradigm, yields optimal performance in the higher cache levels.

  3. The PAPI Measurement Template: This template is built on top of the above two templates, using PAPI’s low level API. The user is given an option to choose between the above two benchmarking paradigms and input the PAPI events to be recorded.

Input pattern specifications consist of a header file and a set of ISCC input files. The initial step is to run the polyhedral code generator for the ISCC input files and transform them into corresponding C code files. The user-chosen driver template is then updated with the appropriate header and source files to create the customized .cpp benchmark driver code file. This benchmark driver code is compiled and executed with runtime arguments such as working set size, thread count and other parameters depending on the access pattern for which the benchmark is run.

Fig. 4: Implementation of the benchmark.

The purpose and functionality of each component in the pattern specifications shown in Figure 4 are described below:

  1. Header file (<kernel>.h):
    This file contains the definitions of the memory mappings, statement macros, and the allocation code.

    • Memory Mapping: Indicates how the statements should map into memory using iterators as input.

    • Statement Macros: The definition of the statement macros substituted in each of the C code files generated from the ISCC input files. Any data referred to within the statement should be referred to indirectly through the data mapping.

    • Allocation Code: Specifies memory allocation of the data spaces used in the given application kernel.

  2. Initialization steps (<kernel>_init.in):
    This ISCC input file specifies the schedule for which the data domains allocated in the header file are initialized. In the C code file generated with ISCC, the associated statement macro specifying initialization steps is substituted when the benchmark is executed.

  3. Execution Schedule (<kernel>_run.in):
    An ISCC input file that defines the iteration space in which the access pattern is executed. The application kernel defined as a macro in the header file is replaced in the .c file generated. This code file consists of the for loop constructs associated with the execution domain which will be substituted in the driver when executed.

  4. Validation condition (<kernel>_val.in):
    This ISCC input file describes the schedule for which the results after executing the kernel is validated. The corresponding C code file generated is then called in the header file to validate the results.

Iii Case Studies

The performance characteristics of a set of computational kernels commonly used in performance studies are presented in this section. The kernels are STREAM’s triad and Jacobi 1D, 2D, and 3D. The kernels were chosen for their simplicity and well understood performance behaviors. The use cases demonstrate the need to separate implementation concerns when studying the performance of even simple kernels. The structure provided by AdaptMemBench improves the breadth of data collected and makes experiment reliability and reproducibility more easily attained. For each kernel we explore the impact of implicit locks, shared data spaces, and false sharing in SMP systems.

Hardware: Experiments were run on one of the nodes in the R2 HPC cluster at Boise State University, which has a 2.40GHz dual Intel Xeon E5-2680 v4 CPU. This node consists of two NUMA domains each containing 14 cores. Each core has a dedicated 32K L1 data cache and 256K L2 cache. The 35 MB L3 cache is shared among all the cores in each NUMA domain. The size of each cache line in this architecture is 64 bytes.

Compilers: GNU’s gcc (version 6.3). When building C++ benchmark drivers, -fopenmp and -O3 optimization flags were used. The -lpapi flag was set for PAPI-enabled benchmark drivers.

Profiling Tool: The benchmark drivers are instrumented with the Performance API (PAPI) [16] library to access performance counters across the CPUs evaluated. PAPI is used to measure cache hits and the requests for exclusive access to cache lines.

Problem size: We executed the benchmarks with problem sizes across all levels of cache and those which exceeded the last-level cache and fit into the main memory. Each benchmark is executed for 1000 time iterations. The number of repetitions is configurable.

Iii-a The Triad benchmark

We demonstrate the simplicity of AdaptMemBench by implementing the triad kernel from the STREAM benchmark due to its brevity and well-known performance. Listings 3 and 4, along with the templates in Listings 1 and 2 illustrate the process of creating a custom benchmark using a combination of input C code files, bypassing the polyhedral code generator. Alternatively, the kernel could have been expressed as a set: . The results are equivalent.

1//Allocation Code
2#define Triad_alloc double* A = double *) malloc(sizeof(double) * n); \
3                    double* B = double *) malloc(sizeof(double) * n); \
4                    double* C = double *) malloc(sizeof(double) * n);
5//Memory Mapping
6#define A_map(i) A[i]
7#define B_map(i) B[i]
8#define C_map(i) C[i]
9//Initialization
10#define Triad_init(i) A_map(i) = 1.0; B_map(i) = 3.0; C_map(i) = 4.0;
11//Statement Definition
12#define Triad_run(i) A_map(i) = B_map(i) + scalar * C_map(i);
13//OpenMP clause
14#define CLAUSE schedule(static)
Listing 3: Header file <triad.h> for the triad benchmark.
1for (int j = 0; j < n; j++){
2       Triad_run(j);
3}
Listing 4: The execution schedule of the benchmark driver generated by combining the input file <triad_run.c> and the template.
Fig. 5: The impact of OpenMP barriers on achieved memory bandwidth.

Cost of Barriers in OpenMP

We use the triad benchmark generated to evaluate the overhead associated with barriers in OpenMP by using the nowait clause. With the AdaptMemBench framework, all that is required is to modify the definition of the macro CLAUSE to be nowait. As memory bandwidth results in Figure 5 indicate, there is significant overhead caused by the barrier, and by breaking the barrier using the nowait clause we are able to achieve a reasonable speedup. Though this modification may not be possible for all computations, e.g. those that have loop carried dependencies, our intention is just to demonstrate the performance degradation caused by compiler-induced locks using the simple triad kernel.

1for(int k = 0; k < ntimes; k++) {
2    #pragma omp parallel for\
3        schedule(static, n/t) nowait
4    for (int i = 0; i < n; i++){
5       A[i] = B[i] + scalar * C[i];
6    }
7}
Listing 5: Utilizing the OpenMP work sharing construct for data spaces of size n and t number of threads.
1int N = n/t;
2#pragma omp parallel
3{
4  int t_id = omp_get_thread_num();
5  for(int k = 0; k < ntimes; k++) {
6    for (int i = 0; i < N; i++){
7      A[t_id][i] = B[t_id][i] + scalar * C[t_id][i];
8    }
9  }
10}
Listing 6: The resultant triad benchmark using the independent data spaces driver template

Overhead of shared data spaces

The shape of the curve in the performance results on the triad benchmark is disconcerting. Specifically, bandwidth in L1 is less than that in L2. There is a significant amount of overhead to utilize shared memory parallel applications. We explore the resultant performance bottleneck with two variants of the triad benchmark: unified data spaces and independent data spaces.

The first variant is implemented with unified data spaces using OpenMP’s work sharing constructs. Listing 5 is a part of the benchmark driver generated from the unified data spaces template with the macro CLAUSE in triad.h set to schedule(static, n/t) nowait.

The second benchmark uses the independent data spaces template implemented with distinct data spaces independent of the threads (listing 6). The only change needed in the benchmark specification is done in the data mapping in the header file. The listing shows the result after macro expansion.

Memory bandwidth results in Figure 6 clearly indicate the benefit of using distinct data spaces over the shared data spaces variant implemented using OpenMP work-sharing and scheduling constructs. Using independent data spaces separates data domains into separate memory regions, eliminating cross-thread communication. This in turn eliminates performance bottlenecks, for example, avoiding multiple threads accessing the same cache line. We observe an approximate two-fold performance boost in the L1 cache with this approach compared to unified data spaces using OpenMP work-sharing constructs, which is deemed to be efficient.

Fig. 6: Illustrating the overhead associated with data shared among threads.
Fig. 7: An experiment to identify the number of data streams fetching simultaneously that gives optimal performance on parallel execution with 28 threads.
1for (int i = 0; i < n/2; i++){
2   A[i] = B[i] + scalar * C[i];
3   A[i+n/2] = B[i+n/2] + scalar * C[i+n/2];
4}
Listing 7: Customized benchmark driver with unified spaces illustrating interleaved optimization for triad
Fig. 8: Illustration of interleaved optimization with a single data space of size n.

Scheduling to Maximize Bandwidth

The triad pattern that comprises three data spaces is often considered to yield optimal performance in a given architecture. With the configurability offered by our benchmarking framework, we expand the number of data spaces evaluated from 3 (in triad) to 20 data streams that are simultaneously read in the body of the loop. This is achieved by modifying the statement definition and memory allocation specifications in the header file.

Figure 7 shows the results of running this experiment in parallel with 28 threads. The memory bandwidth values are inconsistent for working set sizes that sit in L1 cache since small data sets are shared among a large number of threads. Considering working sets in L2 cache, where the performance is more consistent, we observe that the achieved memory bandwidth peaks for 11 data spaces, which is considerably higher when compared to triad that comprises 3 data streams. This experiment led to reschedule the execution to triad.

Listing 7 describes the interleaved optimization implemented for triad. This splits each data spaces of size into two independent blocks of size each. Each of these blocks are fused together to execute in a single iteration and elements in both of these blocks are accessed simultaneously. So, instead of reading three data spaces at the same time, six data streams are accessed concurrently, hence better utilizing the available prefetching lines. Figure 8 illustrates how a single data space is interleaved into two blocks and are fused together to be accessed simultaneously within a single iteration.

Fig. 9: Interleaved optimization for triad is beneficial in L1 cache on parallel execution with 28 threads.

Performance results in Figure 9 illustrate the improvement in achieved bandwidth for triad in the L1 cache. A significant speedup is observed from the naïve triad operation implemented with independent data spaces. For working set sizes falling out of the L1 cache, this optimization is not effective due to poor prefetching. This further validates the experimental results from Figure 7, wherein we achieve higher performance with 6 data spaces (i.e., the naïve hexad operation) than 3, which is the case for triad. We attempted interleaving data spaces for triad with interleaving factors greater than two, but we obtain the highest performance when interleaved by 2, due to access to a single cache line exhibiting truly independent data spaces.

(a) Number of L1 data cache misses
(b) Number of requests to shared cache line
Fig. 10: Cache misses and cache line requests for 3-pt Jacobi 1D. Measurements for the unified data spaces are plotted along the secondary y-axis for better readability of results.
Fig. 11: Illustration of custom benchmark generation for 3-pt Jacobi 1D kernel with unified data spaces using the polyhedral model.

Iii-B Multidimensional Jacobi patterns

Iterative Jacobi stencils are at the core of a wide range of scientific applications and are represented in the Structured Grid motif [4]. These patterns involve nearest neighborhood computations in which each point in a multidimensional grid is iteratively updated by a subset of its neighbors. The polyhedral model is used to generate benchmark drivers for the Jacobi patterns, as it is helpful to test potential optimizations such as tiling, exercising the flexibility of AdaptMemBench.

1#pragma omp parallel
2{
3  int t_id = omp_get_thread_num();
4  for(int k = 0; k < ntimes; k++) {
5     for (int i = 1; i < n - 1; i++){
6        A[t_id * 8][i] = (B[t_id * 8][i - 1] + B[t_id * 8][i] + B[t_id * 8][i + 1]) * 0.33;
7     }
8  }
9}
Listing 8:

The resultant independent data spaces benchmark driver reflecting array padding for Jacobi 1D

Fig. 12: Demonstration of overhead associated with shared data spaces in SMP systems with Jacobi 1D.

3-pt Jacobi 1D benchmark

Figure 11 demonstrates the process of custom benchmark generation for this pattern using polyhedral code generation for the input pattern specifications using the unified data spaces benchmark template. Allocating independent spaces is advantageous for this pattern as well, as reflected by the memory bandwidth results in Figure 12. However, performance scaling in L1 is still an issue, due to false sharing.

Impact of false sharing

In symmetric multiprocessing systems, where each processor core has dedicated local cache(s), false sharing is a well-known performance issue. False sharing occurs when multiple threads involve in modifying independent variables sharing the same cache line, requiring unnecessary cache flushes and subsequent loads. The potential source of false sharing is multiple threads accessing dynamically allocated or global shared data structures simultaneously.

The impact of false sharing is quantified by recording the performance counters using PAPI. We measure the data cache hits in L1 and the requests for exclusive access to shared cache lines in Figure 10(a). We observe that the shared data spaces get affected by cache misses nearly 10 times more than the independent data spaces. Please note that Figures 10(a) and 10(b) each have a primary and secondary y-axis. The data plotted using green triangles is associated with the secondary axis (on the right). The cache misses recorded for independent data spaces is better, but the variation in number of exclusive requests to clean cache line for the three cases in Figure 10(b) is much higher for L1 in the case that suffers from false sharing.

Padding arrays is a common solution to overcome false sharing. In the architecture evaluated, each cache line is of size 64 bytes. As shown in Listing 8, the data spaces of type double are padded with a factor 8 to allocate each element in different cache lines to avoid false sharing. With AdaptMemBench, this can be achieved just by modifying the memory mapping. Eliminating false sharing leads to a drastic performance speedup in the L1 cache, as the results in Figure 12 reflect. The PAPI results were collected by running the same code configurations with a PAPI driver within the framework, and the memory bandwidth results are exclusive of the minimal overhead of accessing hardware counters.

Fig. 13: Illustration of custom benchmark generation for 9-pt Jacobi 2D kernel with unified data spaces using the polyhedral model.
Fig. 14: Analyzing the performance bottleneck caused by shared data spaces in Jacobi 2D.
Fig. 15: Impact of performance with varying memory allocation in Jacobi 3D.

Higher dimensional Jacobi patterns

The process of creating a custom benchmark driver for 9-pt Jacobi 2D using unified data spaces is illustrated in Figure 13. A 7-point Jacobi 3D benchmark driver can be similarly created with an added dimension to the code generation script and corresponding modifications to the pattern specification.

From Figures 14 and 15, it can be noted that separating data spaces into different memory regions is beneficial for both Jacobi 2D and Jacobi 3D. However, false sharing doesn’t affect performance and both the patterns struggle to scale in the L1 cache.

Tiling Optimization for Jacobi transformations

Rectangular space tiling [8]

is one of the traditional optimization strategies for stencil computations. Rectangular tiling breaks a large iteration space into a set of smaller iteration spaces, which improves spatial and temporal locality. When iterating over a large two-dimensional data space applying a multipoint stencil, it is highly probable that one of the neighbors accessed would have fallen out of the cache while the iteration comes around to the same point again. Tiling iteration space eliminates such cache misses and improves data reuse. This optimization is explored, not to provide another data point on the impact of tiling, but to demonstrate the advantages of including polyhedral code representations in the framework.

Tiling Three-dimensional Jacobi

We implement this spatial tiling strategy on the 7-point Jacobi 3D transformation. The initial approach is to tile in the 3D grid in all directions. Listing 9 shows the ISCC input script and corresponding C code file generated. AdaptMemBench simplifies the testing of this optimization with this input ISCC script as execution schedule file with the other pattern specifications remaining the same as for the naïve Jacobi 3D benchmark.

1Domain_run := [n] -> {
2    STM_3DS_run[k,j,i] : i <= n and i >= 1 and j<=n and j >= 1 and k<=n and k >= 1;
3};
4Tiling := [n] -> {
5    STM_3DS_run[k,j,i] -> STM_3DS_run[tk,tj,ti,k,j,i]:exists rk,rj,ri:
6                    0<=rk<32 and k=tk*32+rk
7                and 0<=rj<64 and j=tj*64+rj
8                and 0<=ri<16 and i=ti*16+ri;
9};
10codegen (Tiling * Domain_run);
1for (int c0 = 0; c0 <= floord(n, 32); c0 += 1)
2  for (int c1 = 0; c1 <= n / 64; c1 += 1)
3    for (int c2 = 0; c2 <= n / 16; c2 += 1)
4      for (int c3 = max(1, 32 * c0);                  c3 <= min(n, 32 * c0 + 31); c3 += 1)
5        for (int c4 = max(1, 64 * c1);                  c4 <= min(n, 64 * c1 + 63); c4 += 1)
6          for (int c5 = max(1, 16 * c2);                 c5 <= min(n, 16 * c2 + 15); c5 += 1)
7            STM_3DS_run(c3, c4, c5);
Listing 9: ISCC script Jacobi3D_xyz_tiled.in and the generated C code file Jacobi3D_xyz_tiled.c.

We initially block the iteration space in all the three dimensions, for block sizes , and . Our results agree with previous experimental evaluation showing no performance gain [10].

Fig. 16: Achieved memory bandwidth with 2D Cache blocking for Jacobi 3D with a tile sweep for sizes ranging from 16 to 64 in both the tiled directions.

We implement the partial blocking strategy [17] in which blocking is done in two least significant dimensions alone. This results in a series of 2D slices that are stacked one over the other in the unblocked dimension. We tested the efficacy of this technique on grid sizes up to 256, with block sizes ranging from 16 to 64 in both directions. This approach too does not offer any speedup if we compare the peak bandwidth from Figure 15 with the most performant block area in figure 16. Large on-chip caches affect cache reuse and thus provide no performance gain with this blocking strategy. Increasing grid sizes would be impractical since many scientific applications, such as computation fluid dynamics, typically use a box size of or less [1].

These results confirm conclusions from previous studies [10, 5, 9] on these tiling strategies performed for serial execution. We extend these studies to parallel applications and systems using with the flexibility of the polyhedral model offered by AdaptMemBench. Several temporal tiling strategies [3, 12, 26, 7] have proved to be effective for higher dimensional stencil patterns, which are not evaluated in this paper, but the framework can accommodate them.

Iv Related Work

Several categories of memory benchmarks have been developed over the years. Most relevant to our work are the streaming bandwidth benchmarks, which use a predefined set of access patterns to measure achieved memory bandwidth, and the stencil benchmarks. The following section presents representatives from each benchmarking category.

Our benchmarking framework adds capabilities beyond these benchmarks by offering configurability to explore the performance of scientific applications. It emulates application-specific memory access patterns using the mechanism of polyhedral code generation. It is a flexible and consistent testbed for evaluating various code optimizations without needing to port or modify the entire application.

Iv-a Streaming Bandwidth Benchmarks

STREAM [14] is a microbenchmark that measures sustainable memory bandwidth and the corresponding computation rates for the performance evaluation of high performance computing systems. STREAM measures the performance of four operations: COPY (a[i] = b[i], measures data transfer without arithmetic), SCALE (a[i] = q*b[i], with a simple arithmetic operation), SUM (a[i] = b[i] + c[i], tests multiple load and store operations) and TRIAD (a[i] = b[i] + q*c[i]). The STREAM benchmark does not measure memory bandwidth for small data sizes in the higher levels of memory hierarchy, i.e., in level 1 cache and some portions of level 2 cache, depending on the target architecture. AdaptMemBench calculates the cumulative computation time for the overall execution of the kernel and enabling it to explore achieved performance in higher levels of cache.

MultiMAPS [19]

is a benchmark probe designed to measure platform-specific bandwidths, similar to STREAM, it accesses data arrays repeatedly. In MultiMAPS, the access pattern is varied in stride and array size varying spatial and temporal locality. It measures achieved memory bandwidth of different memory levels, different size working sets and a small set of access patterns. This benchmark is most closely related to ours. The primary difference is the ability to include arbitrary memory access patterns, and test optimization strategies.

Stanza triad [10], a microbenchmark, is a derivative of STREAM, which measures the impact of prefetching on modern microprocessors. It works by comparing the bandwidth measurements by varying stanza length and stride of access for different data sizes and predicts performance. This being a serial benchmark, cannot be scaled to parallel applications, and cannot be configured for patterns other than triad.

Iv-B Synthetic memory benchmarks

Apex-MAP [20] is a synthetic benchmark that characterizes application performance, implemented sequentially [21], and in parallel using MPI [22]. This benchmark approximates the memory access performance based on concurrent address streams considering regularity of access pattern, spatial locality, and temporal reuse. Using a set of characteristic performance factors, its execution profile is tuned such that these factors act as a proxy for the performance behavior of code with similar characteristics.

Stencil Probe [10] is a lightweight, flexible stencil application-specific benchmark that explores the behavior of grid-based computations. Stencil Probe mimics the kernels of applications that use stencils on regular grids by modifying the operations in the inner loop of the benchmark. Similar to Stanza Triad, this benchmark is serial and cannot be extended to large-scale parallel applications and systems. Furthermore, this probe is not friendly for testing code optimizations and requires rewriting of the entire the benchmark code for each transformation.

Bandwidth [18] is an artificial benchmark to measure memory bandwidth on x86 and x86_64 based architectures. This benchmark can be used to evaluate the performance of the memory subsystem, the bus architecture, the cache architecture and the processor. Memory bandwidth is measured by performing sequential and random reads and writes of varying sizes across the levels of the memory hierarchy. However, this benchmark is neither application-specific nor customizable. It measures performance based on a predefined set of memory access patterns and cannot be configured specifically to a target application. Moreover, this benchmark executes serially and cannot be scaled to parallel systems and applications.

Iv-C Application Benchmarks

Application Benchmarks are used as exemplars of application patterns. The NAS Parallel Benchmarks [2] comprises benchmarks developed to represent the major types of computations performed by highly parallel supercomputers and mimic the computation and data movement characteristics of scientific applications. It consists of five parallel kernel

benchmarks (EP - an embarrassingly parallel kernel, MG - a simplified multigrid kernel, CG - a conjugate gradient method, FT - fast Fourier transforms and IS - a large integer sort) and three

simulated application benchmarks (LU - lower and upper triangular system solution, SP - scalar pentadiagonal solver and BT - set of block tridiagonal equations).

The HPC Challenge benchmark suite [13] provides a set of benchmarks that define the performance boundaries of future Petascale computing systems. This hybrid benchmark suite examines the performance of HPC architectures as a function of memory access characteristics using different access patterns. It is composed of well-known computational kernels such as STREAM, HPL [6], matrix multiply, parallel matrix transpose, FFT, RandomAccess and bandwidth/latency tests that span high and low spatial and temporal locality space.

V Conclusions

This paper presents a configurable benchmark framework that captures application-specific memory access patterns that can be expressed using the polyhedral model. The use of the polyhedral model and associated code generation tools allows for quick development and experimentation with optimization strategies. The AdaptMembench framework was used to demonstrate the benefit of using distinct data spaces on threads and the overhead of OpenMP constructs and false sharing when targeting the L1 cache.

References

  • [1] M Adams, P O Schwartz, H Johansen, P Colella, T J Ligocki, D Martin, ND Keen, Dan Graves, D Modiano, Brian Van Straalen, et al. Chombo software package for amr applications-design document. Technical report, 2015.
  • [2] DH Bailey, E Barszcz, JT Barton, DS Browning, RL Carter, L Dagum, RA Fatoohi, Paul O Frederickson, Thomas A L, Rob S Schreiber, et al. The nas parallel benchmarks. The International Journal of Supercomputing Applications, 5(3):63–73, 1991.
  • [3] V Bandishti, I Pananilath, and U Bondhugula. Tiling stencil computations to maximize parallelism. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1–11. IEEE, 2012.
  • [4] P Colella. Defining software requirements for scientific computing. 2004.
  • [5] K Datta, S Kamil, S Williams, L Oliker, J Shalf, and K Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review, 51(1):129–159, 2009.
  • [6] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. The linpack benchmark: past, present and future. Concurrency and Computation: practice and experience, 15(9):803–820, 2003.
  • [7] Matteo Frigo and Volker Strumpen. Cache oblivious stencil computations. In Proceedings of the 19th annual international conference on Supercomputing, pages 361–366. ACM, 2005.
  • [8] François Irigoin and Remi Triolet. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 319–329. ACM, 1988.
  • [9] Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory system performance and correctness, pages 51–60. ACM, 2006.
  • [10] Shoaib Kamil, Parry Husbands, Leonid Oliker, John Shalf, and Katherine Yelick. Impact of modern memory subsystems on cache optimizations for stencil computations. In Proceedings of the 2005 workshop on Memory system performance, pages 36–43. ACM, 2005.
  • [11] Wayne Kelly. Optimization within a unified transformation framework. Technical report, 1998.
  • [12] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. Effective automatic parallelization of stencil computations. In ACM sigplan notices, volume 42, pages 235–244. ACM, 2007.
  • [13] P Luszczek, J J Dongarra, D Koester, R Rabenseifner, B Lucas, J Kepner, J McCalpin, D Bailey, and D Takahashi. Introduction to the hpc challenge benchmark suite. Technical report, Ernest Orlando Lawrence Berkeley NationalLaboratory, Berkeley, CA (US), 2005.
  • [14] John D. McCalpin. Stream: Sustainable memory bandwidth in high performance computers. 1991-2007. A continually updated technical report. http://www.cs.virginia.edu/stream/.
  • [15] John D McCalpin. A survey of memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, 19:25, 1995.
  • [16] P J Mucci, S Browne, C Deane, and G Ho. Papi: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference, volume 710, 1999.
  • [17] Gabriel Rivera and Chau-Wen Tseng. Tiling optimizations for 3d scientific computations. In Proceedings of the 2000 ACM/IEEE conference on Supercomputing, page 32. IEEE Computer Society, 2000.
  • [18] Zack Smith. Bandwidth: a memory bandwidth benchmark, 2008.
  • [19] A Snavely, L Carrington, N Wolter, J Labarta, R Badia, and A Purkayastha. A framework for performance modeling and prediction. In Supercomputing 2002, pages 21–21. IEEE, 2002.
  • [20] E. Strohmaier and Hongzhang Shan. Apex-map: A global data access benchmark to analyze hpc systems and parallel programming paradigms. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, pages 49–49, Nov 2005.
  • [21] Erich Strohmaier and Hongzhang Shan. Architecture independent performance characterization and benchmarking for scientific applications. In Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004.(MASCOTS 2004). Proceedings. The IEEE Computer Society’s 12th Annual International Symposium on, pages 467–474. IEEE, 2004.
  • [22] Erich Strohmaier and Hongzhang Shan. Apex-map: A synthetic scalable benchmark probe to explore data access performance on highly parallel systems. In European Conference on Parallel Processing, pages 114–123. Springer, 2005.
  • [23] Sven Verdoolaege. barvinok: User guide. Version 0.23), Electronically available at http://www. kotnet. org/skimo/barvinok, 2007.
  • [24] Sven Verdoolaege. isl: An integer set library for the polyhedral model. In International Congress on Mathematical Software, pages 299–302. Springer, 2010.
  • [25] Sven Verdoolaege and Tobias Grosser. Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France, pages 1–16, 2012.
  • [26] David Wonnacott.

    Using time skewing to eliminate idle time due to memory bandwidth and network limitations.

    In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pages 171–180. IEEE, 2000.
  • [27] Wm A Wulf and Sally A McKee. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.