The Granularity Gap Problem: A Hurdle for Applying Approximate Memory to Complex Data Layout

01/26/2021 ∙ by Soramichi Akiyama, et al. ∙ The University of Tokyo 0

The main memory access latency has not much improved for more than two decades while the CPU performance had been exponentially increasing until recently. Approximate memory is a technique to reduce the DRAM access latency in return of losing data integrity. It is beneficial for applications that are robust to noisy input and intermediate data such as artificial intelligence, multimedia processing, and graph processing. To obtain reasonable outputs from applications on approximate memory, it is crucial to protect critical data while accelerating accesses to non-critical data. We refer the minimum size of a continuous memory region that the same error rate is applied in approximate memory to as the approximation granularity. A fundamental limitation of approximate memory is that the approximation granularity is as large as a few kilo bytes. However, applications may have critical and non-critical data interleaved with smaller granularity. For example, a data structure for graph nodes can have pointers (critical) to neighboring nodes and its score (non-critical, depending on the use-case). This data structure cannot be directly mapped to approximate memory due to the gap between the approximation granularity and the granularity of data criticality. We refer to this issue as the granularity gap problem. In this paper, we first show that many applications potentially suffer from this problem. Then we propose a framework to quantitatively evaluate the performance overhead of a possible method to avoid this problem using known techniques. The evaluation results show that the performance overhead is non-negligible compared to expected benefit from approximate memory, suggesting that the granularity gap problem is a significant concern.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The impact of main memory access latency to the overall performance is much larger on a computer today than in the past. This is because the performance gap between the main memory and the CPU has ever been enlarging. Figure 1 shows the single thread performance of server-class CPUs plotted over time111Data provided in (Rupp, 2020) under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). The figure shows an exponential growth of the single thread performance until recent years. In contrast, the access latency of DRAM that the main memory consists of has been almost the same for more than two decades. As shown in (Chang et al., 2016), the speedup of the major latency sources of DRAM over time is very marginal, especially when compared to the exponential growth of the CPU performance. Because DRAM access latency occupies substantial amount in a random memory access latency222For example, a random memory access latency in the machine shown in table 2 is around 82 ns (measured by Intel MLC), while the sum of the three major latency sources shown in (Chang et al., 2016) of the DRAM module in this machine is around 45 ns (ASSOCIATION, 2013)., there is a strong need to reduce the DRAM access latency to catch up with the CPU performance.

Figure 1. Trend of single thread performance over time (normalized to SPEC CPU 2006 score 1000): the performance had been increasing exponentially until recent years.

Approximate memory is a technique to reduce the main memory latency by sacrificing its data integrity (Koppula et al., 2019; Tovletoglou et al., 2020; Raha et al., 2017; Nguyen et al., 2020). Prior works have proven that DRAM modules used for main memory can be operated much more aggressively than defined in the specifications. In concrete, the access latency can be reduced by violating the timing constraints of DRAM internal operations at the cost of increased bit-error rate (Lee et al., 2017; Chang et al., 2016; Wang et al., 2018; Hassan et al., 2016; Zhang et al., 2016; Choi et al., 2015)

. Prior works try not to expose the increased bit-error rate to applications by operating DRAM to the extent that the error rate is still small enough to have zero bit-flips during applications’ runtime. Approximate memory exploits the same idea more aggressively to reduce the memory access latency by leveraging the error robustness of applications themselves. It is expected to be beneficial for application domains such as deep learning, multimedia processing, graph processing, and big-data analytics because these applications are known to be robust to bit-flips to some extent 

(Nie et al., 2018; Chandramoorthy et al., 2019; Mahmoud et al., 2019, 2020; Chen et al., 2019).

To obtain reasonable outputs from applications on approximate memory, it is crucial to protect critical data while accelerating accesses to non-critical data. For example, suppose we want to accelerate a deep learning application on top of approximate memory. The matrices that express the weights of each layer are non-critical data because it is known that the accuracy of the trained model does not degrade much even when some bit-flips are injected into them (Chandramoorthy et al., 2019; Givaki et al., 2020; Koppula et al., 2019; Mahmoud et al., 2020)

. On the other hand, pointers from one layer of a network to another or the loop counter that counts the number of epochs are non-critical data because they must be protected from bit-flips. Therefore, we must control the error rate of memory regions depending on the criticality of the stored in them.

A limitation of approximate memory is that the error rate can be controlled only with the granularity of a few kilo bytes due to the internal structure of DRAM chips. We refer the minimum size of continuous data to which the same error rate must be applied to as approximation granularity. The approximation granularity for a given DRAM module is decided by the row size of the module. A row is a sequence of data bits inside a DRAM module that are driven simultaneously with the same timing. Because approximate memory we focus on is based on tweaking the timing of DRAM internal operations, the approximation granularity is equal to the row size. The row size of a DRAM module is in the range of 512 bytes to a few kilo bytes. For example, the row size of a module from Micron (Micron, 2018) is 2 KB, meaning that the approximation granularity of this module is also 2 KB. This stems from the fundamental limitation of modern DRAM that many bits must be driven in parallel to catch up with requests coming from a fast CPU.

The large approximation granularity makes it difficult to gain benefit from approximate memory for applications that have critical and non-critical data interleaved with a smaller granularity (e.g., 8 bytes). We refer to this problem as the granularity gap problem. This can happen when an application manages its data as an array of data structure that has critical members (e.g., pointers) and non-critical members (e.g., numbers whose small divergence do not affect the application’s result). For a concrete example, suppose an application that traverses an array of graph nodes, and each graph node has pointers to its neighboring nodes and a score of it that is robust to bit-flips. The non-critical data of this application cannot be stored in approximate memory due to the difference between the approximation granularity and the granularity of interleaving of critical and non-critical data.

In this paper, we show the granularity gap problem is a significant concern in using approximate memory. In concrete, the contributions of this paper are summarized as follows:

  1. A source code analysis of widely used benchmarks to prove that many applications potentially suffer from the granularity gap problem, extended from our previous work (Akiyama, 2020).

  2. A discussion on pros and cons of a memory layout conversion technique in the context of the granularity gap problem.

  3. A framework to quantitatively evaluate the negative performance impact of the memory layout conversion technique.

  4. Evaluation results of the negative performance impact on widely used benchmarks, which proves the significance of the granularity gap problem in using approximate memory.

The rest of the paper is structured as follows. Section 2 introduces the background knowledge of how DRAM and approximate memory works. Section 3 defines the granularity gap problem and the goal of this paper. Section 4 analyzes the source code of SPEC CPU 2006 and 2017 benchmarks. Section 5 explains a memory layout conversion technique to avoid the granularity gap problem, and points out why it is not sufficient. Section 6 describes our simulation framework and gives quantitative evidence that the granularity gap problem is a significant concernt. Section 7 reviews related work and Section 8 concludes the paper.

2. Approximate Memory Architecture and Its Limitation

2.1. Overview of Approximate Memory

Approximate memory is a new technology to mitigate the performance gap between main memory and CPUs. The main idea is to reduce the latency of main memory accesses at a cost of the data integrity by exploiting design margins that exist in many DRAM chips today. The CPU may read a slightly different data from what has been written before to the main memory. A design margin refers to the difference between a design parameter defined in the specification of a device and the actual value which the device can be operated with. In particular, we focus on the design margin in the timing of internal operations of DRAM. Even when some wait-time parameters are shortened than the specification, many DRAM chips can read stored data “almost” correctly with a few bit-flips (errors) injected to the data (Chang et al., 2016; Kim et al., 2018). By controlling the timing of internal operations of DRAM, we can trade reduced main memory access latency with increased bit-error rate.

Approximate memory attracts much research interest due to the ever-increasing performance gap between main memory and CPUs. Chang et al. (Chang et al., 2016) measure the relationship between error rates and latency reduction for a large number of commercial DRAM chips. Tovletoglou et al. (Tovletoglou et al., 2020) propose a holistic approach to guarantee the service level agreement of virtual machines running on approximate memory. Koppula et al. (Koppula et al., 2019) re-train deep learning models on approximate memory so that the models can adapt to errors. Our previous work (Akiyama, 2019)estimates effect of approximate memory to realistic applications without simulation by counting the number of DRAM internal operations that incur errors.

Approximate memory is especially beneficial for machine learning, multimedia, and graph processing applications, all of which incur many memory accesses and are tolerant to noisy data. For example, Stazi 

et al. (Stazi et al., 2018) show that allocating data in approximate memory for the x264 video encoder can yield acceptable results, and our previous work (Akiyama, 2019) show that a graph-based search algorithm (mcf in SPEC 2006) can yield the same result as error-free execution even when some bit-flips are injected. Regarding the performance improvement, Koppula et al. (Koppula et al., 2019) show 8% speedup in average for training various DNN models on approximate memory, and Lee et al. (Lee et al., 2017) show that using Adaptive-Latency DRAM (Lee et al., 2015) for approximate memory gives 7% to 12% speedup in averag for “32 benchmarks from Stream, SPEC CPU 2006, TPC and GUPS” (they do not show numbers for each benchmark though). The performance improvement of a few to 10+ percent is important to these applications because they are typically executed in large scale data centers, where only a few % of relative efficiency improvement results in a huge absolute reduction of energy and/or runtime in total.

2.2. Design Margin: Timing Constraints

The design margin exploited to realize approximate memory that we focus on is the timing constraints of DRAM internal operations. Although there are other types of approximate memory such as approximate flash memory that leverages multiple levels of programming voltages (Guo et al., 2016) and approximate SRAM based on supply voltage scaling (Chandramoorthy et al., 2019; Yang and Murmann, 2017; Esmaeilzadeh et al., 2012), we focus on approximate DRAM in this paper. A DRAM module is operated by the memory controller that issues electric signals referred to as DRAM commands. A DRAM command triggers an internal operation of DRAM such as resetting the voltages of wires to the reference voltage. A timing constraint refers to the interval between two DRAM commands and we categorize them into two types:

  • Type 1 specifies the interval that must pass before the next DRAM command is issued. They are defined so that the internal operation of DRAM triggered by the previous command is guaranteed to finish before the next command.

  • Type 2 specifies the interval where the same command must be issued again within that interval. This is defined so that the electrical charges inside DRAM does not leak too much by periodically refreshing them.

The actual values of the timing constraints for each type of DRAM module (e.g, DDR4-2400) are specified by JEDEC, which is an organization that publishes DRAM-related specifications.

Relaxing a timing constraint means either shortening or prolonging an interval defined by the specifications (i.e., “violating” the specifications). It reduces the average access latency of DRAM because commands are served faster (by shortening the Type 1 constraints) and it increases the number of useful commands executed (by prolonging the Type 2 constraints). However, it increases the possibility that bit-flips are injected into the data because there is no guarantee that a DRAM module works flawlessly when the timing constraints are violated.

Figure 2. DRAM command sequence of normal memory (top) and approximate memory (bottom): In this example, tRCD is shortened to 7.5 ns and tREF is prolonged to 128 ms, both of which reduce the average latency.

Figure 2 shows an example of DRAM command sequence in normal memory and approximate memory. It shows four representative DRAM commands: refresh (REF), precharge (PRE), activation (ACT), and read (RD). In this example, a timing constraint called tRCD (Type 1) is shortened from 12.5 ns to 7.5 ns, and one called tREF (Type 2) is prolonged from 64 ms to 128 ms. tRCD is the interval that must pass between an ACT command and the following RD command, and it is around 11 – 13 ns depending on the DRAM module (e.g., 12.5 ns for DDR3-1600J (ASSOCIATION, 2010)). Chang et al. (Chang et al., 2016) found that only a small portion of the data bits experience errors even when tRCD is shortened below it. We explain how an ACT command and tRCD work inside DRAM more in detail in Section 2.3. tREF is another timing constraint that specifies the longest interval between two REF commands, which refresh DRAM cells to prevent them from losing stored data. Das et al. (Das et al., 2018) and Zhang et al. (Zhang et al., 2016) propose to prolong this interval because many DRAM cells can retain data for more than 64 ms in practice. Because prolonging tREF increases the amount of time during which more useful commands are served, it reduces the average DRAM access latency.

2.3. Closer Look: ACT Command Example

The left-most side of Figure 3 shows the reset state of DRAM. The circles show an array of memory cells, where each row is connected by a wordline (WL) and each column is connected by a bitline (BL). Although a DRAM chip consists of a hierarchy of many of these arrays operated in parallel, we focus on one array here without loss of generality. A black cell has electric charge in it and a white cell is empty. A cell with charge in it represents a value of 1, an empty cell represents a value of 0. In the reset state, the voltages of all the BLs are set to the reference value denoted as Vref in the figure.

An ACT command takes the target row number as its parameter (for example, the 2 row from the top). The WL of the target row is enabled to connect the cells in the target row to the BLs. The voltages of the BLs connected to cells with charge start being pulled up, and the voltages of the BLs connected to empty cells start being pulled down. At the same time, the cells connected to the BLs become intermediate state (denoted by gray circles in the figure) because the capacitance of a BL is much larger than that of a cell. After tRCD (12.5 ns in the figure) has passed, the voltages of the BLs are guaranteed to be either Vref+ or Vref- as shown in the right-most side of the figure. Finally, the sense amplifiers sense the voltages of the BLs to fetch the values and buffer them.

Figure 3. ACT command copies the value of the selected row to the sense amplifieres. Left: The BLs are reset to Vref in the reset state. Right: After 12.5 ns, the BLs are guaranteed to be either Vref+ or Vref-. Middle: If tRCD is reduced to 7.5 ns, the sense amplifiers sense unstable voltages of BLs (Vx and Vy), resulting in a few bit-flips but a shorter latency.

Although the value of tRCD is strictly defined by JEDEC, real DRAM chips are known to have much design margin in this timing parameter. Previous work (Chang et al., 2016; Kim et al., 2019) show that many bits can be fetched correctly even when tRCD is shortened by a substantial amount. The middle of Figure 3 shows how the ACT command works when tRCD is shortened to 7.5 ns to reduce the memory access latency. Because 12.5 ns has not yet passed, the voltages of the BLs may have not reached to Vref+ or Vref-, but they are unstable values denoted as Vx and Vy in the figure (Vref+ ¿ Vx ¿ Vref ¿ Vy ¿ Vref-). When the sense amplifiers sense the voltages of the BLs at this point, they may fetch wrong values because the difference of Vx and Vy against Vref are not large enough to sense the values. Larger the difference of Vx and Vy against Vref

is, larger the probability of fetching correct values. This way, controlling the timing parameter serves as a knob for trading the access latency and bit-error rate.

2.4. Limitation: Approximation Granularity

A limitation of approximate memory exploiting design margins in timing constraints is that the approximation granularity cannot be smaller than a few kilo bytes. The approximation granularity refers the minimum size of a continuous memory region to which the same error rate must be applied. This is because the same timing parameter is applied to an entire row as we describe in Section 2.3, and the size of a DRAM row (also known as a page (Jacob et al., 2007)) is as large as a few kilo bytes in modern DRAM modules. This stems from a fundamental constraint that many DRAM cells must be driven in parallel so that slow DRAM can catch up with the high rate of requests coming from the CPU. Therefore, the same limitation is applicable to DRAM commands other than ACT and their timing constraints as well. For example, refreshing DRAM cells are also done row by row (i.e, an entire row is refreshed at once) and thus prolonging tREF also affects an entire row at once (Jacob et al., 2007).

We give two examples of the row size in real DRAM modules. A 32 Gb DRAM module from Micron (Micron, 2018) has 64 K rows, 16 banks, and 2 ranks333Right-most column of Table 3 in (Micron, 2018).. The row size of this module is calculated as:

(1)

Note that a column contains more than 1 bit (as explained in p. 416 of (Jacob et al., 2007)), thus multiplying the denominator of the left-most side of equation (1) by the number of columns does not match the capacity of the module. Another 16 Gb DRAM module from SAMSUNG (Samsung Electronics Co., Ltd., 2017) also has a page size (row size) of 2 KB (Samsung Electronics Co., Ltd., 2017)444Right-most column of “16 Gb Addressing Table” on page 9 in (Samsung Electronics Co., Ltd., 2017).. This can be confirmed by a similar calculation to equation (1) using the number of rows ( = 128K) and the number of banks ( banks per bank group 2 bank groups = 8 banks):

(2)

Although this paper focuses only on DRAM, the same limitation is applicable to other memory technologies because the fundamental performance gaps exist between CPUs and any memory technologies known today. This is also true for non-volatile memory technologies that are emerging to mitigate the energy and density issues of DRAM. For example, phase change memory (PCM) injects electric pulses to an entire row at once for writing (Nishi and Magyari-Kope, 2019). If we consider realizing approximate memory with PCM, for example by reducing the length of pulses, the approximation granularity is still limited by its row size.

3. Critical Data Protection and Challenge

3.1. Critical Data Protection

Even for applications that can tolerate noisy input and intermediate data, they have critical parts of data that must be protected from bit-flips. For example, deep learning is known to be robust to bit-flips (Chandramoorthy et al., 2019; Givaki et al., 2020; Koppula et al., 2019; Mahmoud et al., 2020) but not all parts of the data are robust to them. Pointers from one layer of a network to another or the loop counter that counts the number of epochs must be protected from bit-flips.

Protecting critical parts of data requires two steps:

  1. Detecting which parts of data are critical and which parts are non-critical

  2. Storing non-critical parts of data into approximate memory while storing the critical parts to normal memory

For step (1), there have been much effort (Akiyama, 2019; Ashraf et al., 2015; Wei et al., 2014; Nie et al., 2018; Mahmoud et al., 2019) and it is out of the scope of this work, so we assume that discrimination of critical and non-critical data is given. For step (2), because the timing constraints are controlled per row, we must map the critical and non-critical parts of data into different DRAM rows running with different timing parameters: the timing parameter same as defined in the specification and the one shortened for faster accesses.

Figure 4. Mapping critical and non-critical parts of data into different DRAM rows to protect the former while reducing access latency to the latter.

Figure 4 depicts an example of mapping critical and non-critical data into different DRAM rows. In the figure, suppose the variables N and i are critical because the former decides the size of allocated memory and the latter is a loop counter, and the memory region pointed to by A is non-critical. N and i are mapped to the first row in the figure that is applied normal timing parameters so that it yields no bit-flips. The data pointed to by A is mapped to rows at the bottom that are applied tweaked timing parameters so that they can be accessed faster. By mapping data of different criticality to different DRAM rows as in the figure, we can protect critical data while improving the access latency to non-critical data.

3.2. The Granularity Gap Problem

struct node_t {
  int id;       // id of the node, critical
  struct node_t *r; // pointer to the right child, critical
  struct node_t *l; // pointer to the left child, critical
  double score; // score of this node, non-critical
};
int size = 1000 * sizeof(struct node_t);
struct node_t *nodes = malloc(size);
Figure 5. Critical and non-critical data interleaved in a single C struct: it is not possible to protect the critical parts while storing the non-critical parts on approximate memory due to a large approximation granularity (e.g., 2 KB).

A challenge in using approximate memory is the gap between the approximation granularity and the granularity at which critical and non-critical data are interleaved. We call this problem the granularity gap problem. We say critical and non-critical data are interleaved when they co-locate inside one instance of a C struct or a C++ class. Figure 5 shows an example of interleaved critical and non-critical data. The data structure struct node_t contains both critical and non-critical data, and a pointer named nodes points to an array of struct node_t. To gain benefit from approximate memory for this code, we must protect the critical parts (id, r, and l) while storing the non-critical part (score) into approximate memory. This is not possible because the approximation granularity is as larger as a few kilo bytes (say 2 KB), while we need to enable or disable approximation with a granularity of 4 bytes to achieve it.

The granularity gap problem has been overlooked by the research community because it is not relevant to applications that have large chunks of non-critical data. For example for deep learning applications, the non-critical data are matrices storing the weights of a network whose sizes range from a few kilo bytes to hundreds of mega bytes. In this case, we can store entire matrices into approximate memory and the approximate granularity is not an issue.

The goal of this paper is to prove the significance of the granularity gap problem with quantitative evidence. First, we show that there are many applications that potentially suffer from this problem. Second, more importantly, we show that avoiding this problem with a known technique has negative performance impact that is as large as almost canceling the benefit of approximate memory.

4. Source Code Analysis

To show that many real applications can potentially suffer from the granularity gap problem, we analyze source code of widely used benchmarks in this section.

4.1. Analysis Methodology

For a given application, we find if the data structure that can obtain benefit from approximate memory has critical and non-critical data interleaved with smaller granularity than the approximation granularity. Because approximate memory is the most effective when an application’s data that incur many cache misses are stored on it, we focus our analysis on a data structure that incurs the largest number of cache misses within an application. We refer to such a data structure as the most cache-unfriendly data structure. After finding such a data structure, we analyze it to estimate if the application potentially suffers from the granularity gap problem.

To find the most cache-unfriendly data structure of an application, we first measure the number of cache misses per instruction using Precise Event Based Sampling (PEBS) on Intel CPUs. PEBS is an enhancement of normal performance counters that uses designated hardware for sampling to reduce the skid between the time an event (e.g., a cache miss) occurs and the time it is recorded (Bakhvalov, 2018; Weaver, 2016). The small skid enables pinpointing which instruction in an application binary causes many hardware events. We execute a benchmark with its sample dataset using linux perf, and the actual command line is ‘perf record -e r20D1:pp -- benchmark’. The parameter r20D1:pp specifies a performance event whose event number is 0xD1 and the umask value is 0x20 and “counts retired load instructions with at least one uop that missed in the L3 cache” (described in Table 19.3 of  (Intel, 2018)). Note that the “L3 cache” is the last level in the cache hierarchy in the CPU we use (described in Table 2). The parameter benchmark is replaced by an actual command line to execute each benchmark.

Figure 6. Sample output of perf report: It shows an instruction, the offset of the instruction from the head of the binary, and the percentage of cache misses that it incurs (if any), from right to left. The C code, if ( arc-¿ident ¿ BASIC ), corresponds to the assembly code below it.

After measuring the number of per-instruction cache misses, we find the data structure accessed by this instruction, which is the most cache-unfriendly data structure of this application. Due to the lack of off-the-shelf tools to disassemble an arbitrary binary into C/C++ source code, we rely on human knowledge and labor to do this. Figure 6 shows an output of perf report, executed after a measurement by perf record. The measurement is done for a benchmark called mcf in SPEC CPU 2006 and the details of the benchmarks we analyze are described in Section 4.2. Each line shows, from right to left, an instruction, the offset of the instruction from the head of the binary, and the percentage of cache misses it incurs (if any) against the overall cache misses in the measurement. The C code, if( arc->ident > BASIC ), corresponds to the lines of assembly code below it. From the figure, we can see that the mov instruction at offset 0x3200 incurs 48.58 % of all cache misses of this application. We can confirm that this instruction incurs the largest number of cache misses by checking that no other instruction incurs more than this percentage.

To finding the data structure that the mov instruction accesses given Figure 6, we analyze the assembly code with help of the debug information and the source code. In Figure 6, we can see a typical pattern of assembly code where a jump instruction (jle) follows after a compare instruction (test, which is commonly used to compare a register with 0). Therefore, we can guess that the mov instruction copies a value to be compared with 0. From the C code corresponding to this block of assembly, we can see that the value compared with 0 should be arc->ident. This is confirmed by the fact that BASIC is a compile-time constant whose value is 0, and that the offset of the ident member inside arc is 0x18. As a conclusion, the mov instruction accesses the variable named arc, whose data type is struct arc. Note that the same methodology is applicable to a template function in C++ as well because there is an independent piece of assembly code for each instantiation of it (i.e., no type-ambiguity exists in assembly).

4.2. Experimental Setup

SPEC CPU 2006
Name Domain Cache Miss Rate
milc quantum simulation 82.6%
sjeng game AI (chess) 74.5%
libquantum quantum computing 54.6%
lbm fluid dynamics 49.2%
omnetpp discrete event simulation 47.9%
soplex linear programming 41.2%
gobmk game AI (go) 38.4%
gcc c compiler 36.8%
mcf optimization 33.7%
dealII finite element analysis 33.6%
namd molecular dynamics 21.0%
SPEC CPU 2017
Name Domain Cache Miss Rate
deepsjeng_r game AI (chess) 77.5 %
nab_r molecular modeling 64.9 %
omnetpp_r discrete event simulation 56.1 %
namd_r molecular dynamics 50.4 %
lbm_r fluid dynamic 48.8 %
x264_r video encoding 47.3 %
mcf_r optimization 43.5 %
gcc_r c compiler 36.6 %
blender_r image processing 35.0 %
xz_r data compression 31.6 %
perlbench_r perl interpreter 21.4 %
Table 1. Analyzed Benchmarks
CPU Intel Xeon Silver 4108 (Skylake, 8 cores)
Memory DDR4-2666, 96 GB (8GB 12)
LLC 11 MB (shared across all the cores)
OS Debian GNU/Linux 10 (kernel: 4.19.0-6-amd64)
gcc/g++ 8.3.0 (Debian 8.3.0-6)
Table 2. Experiment Environment

Table 1 describes the benchmarks we analyze555Although deepsjeng is named ‘deep’, it uses a classical tree search algorithm.. Each line shows a benchmark’s name, its domain, and the cache miss rate measured by the linux perf tool. From both SPEC CPU 2006 and 2017, we analyze benchmarks whose cache miss rates are more than 20 %. We exclude others because approximate memory is not beneficial for CPU intensive benchmarks with low cache miss rates. We also exclude ones written in Fortran because the memory layout conversion technique we discuss in Section 5 is mainly researched for programs written in C. We include ones written in C++ because the difference between C++ and C (the existence of classes, templates, and some new syntax) do not affect the applicability of the memory layout conversion technique.

Table 2 shows the machine we use to execute the benchmarks. For the input data set, we use the largest ones among provided. That is, we use the one named ref for SPEC CPU 2006 and the one named refrate for SPEC CPU 2017. The LLC miss rate is measured using the linux perf tool with the following command: perf -e cache-misses,cache-references -- benchmark, where benchmark is replaced by an actual command for each benchmark.

4.3. Results

SPEC CPU 2006
Benchmark Data Type S P F I
milc complex[]
sjeng QTType[]
libquantum quantum_reg_node_struct[]
lbm double[]
omnetpp cChannel
soplex Element[]
gobmk hashnode_t[]
gcc rtx_def
mcf arc[]
dealII double[]
namd CompAtom[]
SPEC CPU 2017
Benchmark Data Type S P F I
deepsjeng_r ttentrty_t[]
nab_r INT_T[]
omnetpp_r sVector
namd_r CompAtom[]
lbm_r double[]
x264_r uint8_t[]
mcf_r arc[]
gcc_r -
blender_r VlakRen[]
xz_r uint8_t[], uint32_t[]
perlbench_r char[]
Table 3. Results of Source Code Analysis (S: is a C struct or a C++ class, P: has a pointer, F: has a fp, I: has an integer)

Table 3 shows the analysis results. Each row shows a benchmark, the most cache-unfriendly data structure, flags that represent the kinds of members that the data structure contains:

  • S: the data is either a C struct or a C++ class.

  • P: the data structure contains a pointer.

  • F: the data structure contains a floating pointer number.

  • I: the data structure contains an integer.

The data type column is denoted by [] if the data is managed as an array of that data type. We regard any type compatible with an integer (e.g., char, long) as an integer. If a class inherits other classes, we include the members of the parent classes as well because an instance of a child class in the memory contains all members of the parent classes. We exclude static members and member functions because they are not stored in the memory region allocated for each instance. We do not show the result for gcc_r in because cache misses are scattered across many instructions. Two data types are shown for xz_r because two instructions incur almost the same number of cache misses. For all the benchmarks, the instruction that incurs the largest number of cache misses existed in their own code and not in any standard C/C++ libraries.

The results show that many applications potentially suffer from the granularity gap problem. The most cache-unfriendly data structure is either a C struct or a C++ class in 9 out of 11 benchmarks in SPEC CPU 2006 and 5 out of 11 benchmarks in SPEC CPU 2017. Although there are only two benchmarks (omnetpp_r and blender_r) that have a pointer and a floating point number in its most cache-unfriendly data structure, this does not mean that these two are the only benchmarks that suffer from the granularity gap problem. For example, the data type arc in mcf and mcf_r contains a pointer and an integer named cost, which represents the cost of a graph edge. Our previous work (Akiyama, 2019) shows that even if some bit-flips are injected into the member cost, mcf can yield the same result as an error-free execution. Therefore, we conclude that these 14 applications “potentially” suffer from the granularity gap problem.

4.4. Drawbacks of the Methodology

Manual effort to find the data type accessed by a given instruction incurs a scalability issue and increases the chances of analysis errors. There are two error patterns stemming from the manual effort:

  1. Mis-identifying the variable in the source code that corresponds to a given memory access instruction

  2. Mis-identifying the type of data that is stored in the identified variable in source code

Pattern (1) can happen when the application binary has complex data/control flows for example with multiple levels of indirection (e.g., a-¿b-¿c) or when the binary does not look similar to the source code due to compiler optimizations. Pattern (2) can happen when the declared type of a source variable and the type of actual data stored in it are different (i.e., polymorphism). Developing compiler support to reduce the possibilities of these errors is future work.

Another concern for our analysis arises when a member variable of a C struct or a C++ class is passed to a function by reference. For example in Figure 7, the same function (f) is called either by passing &s1.v or &s2.v as its argument. Finding the data type that the memory region pointed to by fp belongs to requires an investigation of stack traces and points-to analysis (Steensgaard, 1996). Although it seems more natural for a function to take a pointer of a whole struct such as ‘void g(struct S1 *sp)’, this may appear in some cases such as when a library function returns the result through a pointer. However, we did not hit this case in any of the benchmarks in our experiment.

struct S1 {
  double v; // non-critical
  double vv; // non-critical
} s1;
struct S2 {
  double v; // non-critical
  int *p; // critical
} s2;
void f(double *fp) {  /* do something */ }
f(&s1.v); // (1): invoke f by passing s1.v by reference
f(&s2.v); // (2): invoke f by passing s2.v by reference
Figure 7. Calling the same function by passing members of different structs by reference. Identifying the data type that *fp belongs to requires stack traces and points-to analysis.

5. Memory Layout Conversion

This section discusses an applicability of a memory layout conversion technique to avoid the granularity gap problem, and points out that it can degrade the performance for some applications. We show in Section 6 that this performance overhead is as large as almost canceling the benefit of approximate memory in some cases.

5.1. AoS to SoA Conversion

struct {
  double x;
  double y;
} points[N];
// calculate the center
double center_x = 0, center_y = 0;
for(i = 0; i<N; i++) {
  center_x += points[i].x / N;
  center_y += points[i].y / N;
}
Figure 8. Example of an array of structures. The data structure {x, y} consists an array of structures named “points”.
struct {
  double x[N];
  double y[N];
} points;
// calculate the center
double center_x = 0, center_y = 0;
for(i = 0; i<N; i++) {
  center_x += points.x[i] / N;
  center_y += points.y[i] / N;
}
Figure 9. AoS to SoA conversion is applied to the code in Figure 8. Each member, x and y, of the struct is allocated a distinct array for it.

An array of structures (AoS) can be converted into a structure of arrays (SoA) without changing the results of an application. Given an array of C struct instances, this technique converts the memory layout of an application so that each member of the C struct is stored as a distinct array. Figure 8 and Figure 9 show an example of this conversion done explicitly by hand. The code in Figure 8 calculates the center of points (in some sense) that are stored in memory as an array of structures. Figure 9 shows the converted version of the code that does the same calculation. This version manages each member of the data structure, x and y, as a distinct array. Note that following code that access the data are also changed in Figure 9 (e.g., from points[i].x to points.x[i]).

Altough the AoS to SoA conversion seems very difficult at a glance for realistic applications, existing research has proven it to be possible at compile time (Zhao et al., 2007; Curial et al., 2008; Lin and Yew, 2010). The main difficulty stems from the fact that a pointer can have an arbitrary address in C/C++. For example in Figure 8, if another pointer points somewhere inside a memory range pointed to by points, it is not easy to apply the conversion without changing the application’s output. However, points-to analysis (Steensgaard, 1996) solves this problem in almost linear time of the source code length.

Besides the technical difficulties that have been tackled by many researchers (e.g., how to apply it dynamically to programs without the source code, how to ensure safety in weakly typed languages), a fundamental limitation of the AoS to SoA conversion is that there is no method to precisely predict its effect on performance. Petrank et al. (Petrank and Rawitz, 2002)

show that predicting the number of cache misses that a given data layout generates for an arbitrary memory access pattern is NP regarding the number of data objects. This means that one must either do exhaustive experiments for memory access patterns under interest or use heuristics to informally estimate the performance implication. This limitaion leads us to do the former to evaluate its performance overhead in a later section.

5.2. Pros: Mitigate the Granularity Gap Problem

The AoS to SoA conversion enables using approximate memory even when critical and non-critical data are interleaved by avoiding the granularity gap problem. Because each member of the converted data structure is stored in a distinct array, it can be mapped to a designated DRAM row that has the appropriate timing parameter for the criticality of that member.

Figure 10 depicts how we can selectively store non-critical data of the code in Figure 5 to approximate memory by the AoS to SoA conversion. Gray boxes in the figure show critical data and white boxes show non-critical data. In the original code that manages the data as an AoS, it is not possible to selectively protect the critical data while accelerating accesses to the non-critical data because of the granularity gap problem (Figure 10 (a)). In the converted code that manages the data as a SoA, the non-critical data (score) consists a distinct array and it can be mapped directly to approximate memory, while the critical data (id, r, l) can be mapped to normal memory (Figure 10 (b)).

Figure 10. The change of memory layout when the AoS to SoA conversion is applied to the code in Figure 5.

5.3. Cons: Negative Impact on Performance

The disadvantage of the AoS to SoA conversion is that it can degrade the performance due to increased number of cache misses. In the code in Figure 5, it is highly possible that all the members of the same struct instance (that is, for any i, nodes[i].id, nodes[i].r, nodes[i].l, and nodes[i].score) share the same cache line. Thus, accessing more than two members of the same struct instance closely in time incurs at most 1 cache miss. However, if we apply the AoS to SoA conversion to the same code, members that are in the same struct instance in the original code do not share the same cache line. This might increase the number of cache misses and degrade the performance depending on the memory access pattern to the data to be converted.

// points to the first node
struct node_t *node = malloc(sizeof(node_t) * 1000);
while(/* until some condition is met */) {
  // do something, then traverse the next node
  if (node->score > threshold)
    node = node->l;
  else
    node = node->r;
}
Figure 11. A sample code accessing an AoS. The definitions of node_t is the same as Figure 5. Applying the AoS to SoA conversion to it increases the number of cache misses.

For example, the code in Figure 11 decides which child (either right or left) of the current node to traverse next depending on the score of the current node, and its memory access pattern is unpredictable. When the AoS to SoA conversion is applied to this code, node->score is stored in a different cache line from node->l and node->r. Because the memory access pattern is unpredictable, an access to a new cache line incurs a cache miss every single time if the total amount of the data is large enough compared to the cache size. Therefore, applying the AoS to SoA conversion to this code increases the number of cache misses from 1 miss per each while(...) iteration to 2 misses per iteration.

6. Evaluation of Performance Impact

The negative performance impact of the AoS to SoA conversion (the details in Section 5.3) is a serious concern if it cancels or even outperforms the benefit of approximate memory. However, to the best of our knowledge, there is no study on how the AoS to SoA conversion slows down applications, because research have focused on how to speed them up. This section introduces a new methodology to quantitatively analyze the slowdown given by the AoS to SoA conversion, and shows that it is as large as almost canceling the benefit of approximate memory in the worst case.

6.1. Overview

In order to quantitatively analyze the slowdown and show its significance, we propose a method to estimate the effect of memory layout changes incurred by the AoS to SoA conversion. The main idea is to use a cycle accurate simulator of CPUs to run applications and reproduce the memory layout that the AoS to SoA conversion would generate inside the simulator. The use of cycle accurate simulator has two advantages (more details are described in Section 6.3):

  1. It can quantitatively tell how much slowdown an application experiences due to memory layout conversion. This is important because the significance of the granularity gap problem is determined by how the slowdown is large or small relative to the benefit of approximate memory.

  2. It is more robust than actually applying the conversion because it does not require complex source code analysis.

6.2. Pseudo Conversion by CPU Simulator

Figure 12. Simulation framework to estimate the negative performance impact of the AoS to SoA conversion.

Figure 12 shows how our simulator estimates the performance impact by reproducing the memory layout changes:

  1. The source code of the target application is annotated so that it prints the starting addresses and the sizes of memory regions that contain the most cache-unfriendly data structure. For benchmarks written in C, this is done by finding malloc calls whose return values are casted to the pointer type of that data structure. For benchmarks written in C++ and use the standard template libraries (STLs), this is done by replacing the memory allocator of the STLs.

  2. The target application is executed on a vanilla simulator to gain the starting addresses and the sizes printed by the annotations added in step (1).

  3. The remap info that decides which members of the struct are stored in distinct arrays is defined. The remap info contains the size of each struct member and a boolean value that represents if it is stored into a distinct array (we say that a member is remapped if this value is true).

  4. A simulation is started on our modified simulator with information obtained in step (2) and step (3) (the starting addresses and sizes of memory regions and the remap info) passed as inputs.

  5. While in a simulation, the target addresses of memory access instructions are investigated. If the target address points to a remapped member, it is converted to reproduce the memory layout that the AoS to SoA conversion would generate.

The address conversion is done when the front-end of the CPU inserts requests into the load store queue. The component that converts addresses is illustrated as the address remapper in Figure 12. This is because the border between the front-end and the load store queue is a place right after an accessed address is determined and right before it is used. Inside the front-end, the target address of a memory access instruction might not be ready, for example when the register that contains the address is an operand of a not-yet-committed instruction. Inside the load store queue, the address of a request is used to access the caches first before accessing the main memory. Therefore, we convert an address before it is referenced in the load store queue to maintain the cache consistency.

Three requests are passed from the front-end to the address remapper in Figure 12:

  1. 8-byte read request to 0x40000.

  2. 8-byte write request to 0x40008.

  3. 4-byte read request to 0x40010.

From the starting address of the memory region that contains the most cache-unfriendly data structure and the remap info, the address remapper can find that the first request reads the member p. Because the remap info specifies that p is not remapped, the request is passed as-is to the load store queue. The second request accesses the member v. Because the remap info specifies that it is remapped, its target address is converted into an unused address (0xffff0000 in the figure). The third request accesses the member id. Although it is not remapped, its address is shifted by 8 bytes because the previous member v is remapped “away”. Thus, the target address is converted to 0x40008. As a result, the memory layout from the application’s point of view is converted into the one shown in the figure. The member v consists a distinct array and the other members are packed as if there is no v in-between.

6.3. Discussions

With regard to the effect of memory layout conversion, there are efforts to estimate how it speeds up applications (Zhao et al., 2007; Miucin and Fedorova, 2018; Ye et al., 2019) by investigating their memory access traces without applying the memory layout changes themselves. They measure the access frequencies to struct members and the access affinities between them from a memory trace of unmodified source code. Given these metrics, they suggest which members should be placed closer in memory and which members should be separated to a different memory region. However, we cannot directly leverage this method for our purpose because they only “suggest” better memory layouts but do not quantitatively estimate the performance impact of the suggested layouts. A difficulty when it is used for our purpose is that memory layout conversion has two effects of opposite directions: (1) slowdown caused by increased number of cache misses due to separation of members with strong affinities, and (2) speedup caused by decreased size of members that are not separated as distinct arrays. For example, the size of arc data structure in mcf is 72 bytes (9 members 8 bytes/member) and separating one member as a distinct array makes the size of the rest to 64 bytes, which fits within a cache line. In fact, we tried our best but failed to predict the results in Section 6.5 using the same access affinity metric as existing work (Miucin and Fedorova, 2018; Ye et al., 2019) (we omit the details for brevity).

A challenge of actually applying the SoA to AoS conversion in the compiler-level stems from the fact that pointers can contain any addresses in C/C++ and the values held by pointers cannot be decided by a static analysis. Due to this, although theoretically possible by points-to analysis (Steensgaard, 1996), it is not easy to robustly implement the AoS to SoA conversion in the compiler level. Some old gcc versions supported structure reordering, which reorders members of a C struct and requires the same type of analysis. However, this feature was removed because it “did not always work correctly” (Free Software Foundation, Inc, 2019). In contrast, because our method converts memory addresses inside a simulator at runtime, there is no difficulty in finding the address that a pointer contains.

A disadvantage of our method is that a cycle accurate simulation is needed for every single conversion pattern. This is not always possible because the number of memory layout conversion patterns increases exponentially to the number of members in the most cache-unfriendly data structure. On the other hand, if can we somehow estimate the slowdown only from access frequencies and affinities, we can estimate the slowdown of all conversion patterns at once because the access frequencies and affinities can be obtained by one execution of a non-modified application.

6.4. Experimental Setup

ISA x86_64
Frequency 3 GHz
Issue Width 8
Reorder Buffer 192 entries
L1 cache 32 KB, 2 way, 32 MSHRs, 2 cycles/miss
L2 cache 2 MB, 8 way, 32 MSHRs 20 cycles/miss
Mem Ctrl Latency 75 ns
Table 4. Simulated Environment

Table 4 shows the simulated environment. The “Mem Ctrl Latency” shows the length of time between a point when the CPU sends a request to the memory controller and a point when it receives the response. The memory access latency from software point of view additionally contains the time it takes to miss the caches, which is 7.3 ns (= (2 + 20) cycles ns per cycle) and makes up a total of 82.3 ns. We use version 20.0.0.0 (the latest version as of May 2020) of gem5 and its SE mode. Besides simulating instructions, this mode emulates system calls by replacing them with calls to normal functions defined in the gem5 source code. It allows easy simulation because there is no need to run an entire OS on the simulator, and there should be no noticeable impact on the results as non OS-intensive workloads have few system call invocations. The benchmark binaries are compiled by gcc 8.3.0 (Debian 8.3.0-6).

Figure 13. Evaluation result for mcf_r
Figure 14. Evaluation result for deepsjeng_r
Figure 15. Evaluation result for namd_r

For the evaluation, we first skip the initial phase of each benchmark using a simulation mode that only emulates a CPU using the AtomicSimpleCPU model. After the initialization phase, we simulate a fixed number of ticks using a mode that simulates an out-of-order CPU using the DerivO3CPU model. A tick is the notion of time in gem5 and it is 1/1000 ns in our configuration. We simulate 200 billion ticks after the initialization phase, which is equal to 0.2 seconds in the simulated world. The initialization phase of each benchmark is determined by investigating the source code.

We evaluate three benchmarks from SPEC CPU 2017, namely mcf_r, deepsjeng_r, and namd_r. For each benchmark, we test every possible memory layout conversion pattern and compare the performance. Let be the number of members in the most cache-unfriendly data structure in each benchmark, we test all cases of remapping. Note that remapping a given members (and not remapping the rest) is equivalent to not remapping the members (and remapping the rest) with regard to the memory layout. We exclude blender_r and omnetpp_r although their most cache-unfriendly data structures are C++ class. This is because (1) the source code of blender_r does not have clear separation between the initialization phase and the main computation, and (2) omnetpp_r has 21 members in its most cache-unfriendly data structure (sVector) and it is not possible to test possibilities. The latter stems from a disadvantage of our method that we must conduct time-consuming simulations for different memory layout conversion patterns, as we describe in Section 6.1.

6.5. Results

Figure 13, Figure 14, and Figure 15 show the evaluation results of performance degradation for mcf_r, deepsjeng_r, and namd_r. Each bar corresponds to a memory layout conversion pattern and each graph has bars, where is the number of members of the most cache-unfriendly data structure. The right-most bar shows the average of all patterns. The values show the number of executed micro operations during simulation normalized to the value when memory layout conversion is not applied. The bars are ordered by their values. Because we simulate a fixed number of ticks, lower bars show larger negative performance impact.

mcf_r::

The most cache-unfriendly data structure (arc) has 9 members and there are memory layout conversion patterns. Among them, 229 patterns yield worse performance than the no conversion case (the values are smaller than 1). The lowest performance is observed when the first three members are remapped to consist distinct arrays, and its performance is 8.13 % slower than the no conversion case. The 9 members are all 8 bytes in size and thus this pattern makes the size of the non-remapped part to be 48 bytes. The average negative performance impact is 3.81 %.

deepsjeng_r:

The most cache-unfriendly data structure (ttentry_t) “wraps” an array of length 4, and each member of the array is another C struct whose number of members is 5 (see Figure 16 for illustration). We apply the same remapping policy for the same members of the inner data structure, resulting in memory layout conversion patterns (if ttentry_t.array[0].d1 is remapped, ttentry_t.array[i].d1 are remapped for any i ). The lowest performance is observed when the first and the fourth member (d1 and d4 in Figure 16) are remapped, resulting in 2.90 % slowdown.

namd_r:

The most cache-unfriendly data structure (CompAtom) has 7 members and there are memory layout conversion patterns. The negative performance impact to this application is negligible (0.1 % in the worst case). In average, the performance is improved by 0.4 %.

struct inner_struct {
  type1 d1;
  type2 d2;
  type3 d3;
  type4 d4;
  type5 d5;
};
struct ttentry_t {
  struct inner_struct array[4];
};
Figure 16. The most cache-unfriendly data structure (ttentry_t) of deepsjeng_r. The details of inner_struct are not shown because SPEC CPU is non-free software.

The negative performance impact of the memory layout conversion to avoid the granularity gap problem is not negligible compared to the benefit of approximate memory. For example, Kim et al. (Kim et al., 2018) report that the average speedup of SPEC CPU 2006 benchmarks when the timing constraints are violated is around 4 - 5 % (Figure 8 of (Kim et al., 2018)). Note that their system, Solar-DRAM, does not reduce the latency to the extent that bit-flips are visible to the applications. Even if we assume that the performance gain by approximate memory (allowing bit-flips to be visible to the application) is twice as large as Solar-DRAM, it is almost canceled in the worst case by the performance overhead due to the granularity gap problem (8 - 10 % speedup vs. 8.13 % slowdown). Another research by Tovletoglou et al. (Tovletoglou et al., 2020) report that their system can save up to around 12.5 % of overall (CPU + memory) energy consumption for mcf in SPEC CPU 2006 (the bar labeled as “429” in Figure 7 (c) of (Tovletoglou et al., 2020)) by approximate memory (prolonging tREF). Prolonging tREF reduces not only the memory access latency but also the energy consumption of DRAM (Das et al., 2018), which is another benefit of approximate memory besides performance. If we assume that the negative performance of memory layout conversion to mcf is similar to the one to mcf_r666They are quite similar and the most cache-unfriendly data stuructures are the same., the 12.5 % gain is deducted by a non-negligible amount, because we have up to 8.13 % performance overhead by memory layout conversion. Therefore, we conclude that the granularity gap problem is a significant issue and the research community needs efforts to solve it with low overhead to expand the benefit of approximate memory into wider range of applications.

7. Related Work

To the best of our knowledge, we are the first to study the granularity gap problem. One of the reasons is that it is not relevant when we consider storing only large arrays of numbers such as weight matrices of a neural network to approximate memory. However, as we point out in this paper, it is a significant problem for many realistic applications. Esmaeilzadeh 

et al. mention (Esmaeilzadeh et al., 2012) about this problem a bit, but they provide no further investigation.

Nguyen et al. (Nguyen et al., 2020) propose a method that partially mitigates the granularity gap problem. It transposes rows and columns of data layout inside DRAM so that a chunk of data is stored across many rows that have different error rates. This enables protection of important bits (e.g., the sign bit of a floating point number) while aggressively approximating less important bits. This mechanism is effective for DNNs because they require the whole part of a large weight matrix at once and the number of memory accesses do not increase regardless of the data layout. However, it is not effective in general cases where memory is accessed with smaller granularity.

Mapping data into memory regions with different error rates depending on its criticality is commonly proposed. Liu et al. (Liu et al., 2011) partition a DRAM bank into bins with proper refresh interval and ones with prolonged refresh interval. Each data is store into either type of bins depending on the criticality specified by the programmer. Although they do not discuss the minimum bin size, it cannot be smaller than a DRAM row as we discuss in this paper. Chen et al. (Chen et al., 2016) propose a memory controller that maps data into different DRAM banks with different error rates depending on the criticality of the data. Because this method is bank-based, the approximation granularity is limited to the bank size. A typical DDR3/DDR4 DIMM module has 2 GB to 16 GB with either 8 or 16 banks, resulting in a typical bank size of 256 MB to 2 GB. Raha et al. (Raha et al., 2017) advance a previous work (Liu et al., 2011) by measuring each bin’s error rate at a given prolonged refresh interval and assigning them to approximate data in the ascending order of the error rate. They realize the bin size (or “page size” in their terminology) of 1 KB by measuring the average error rate per 1 KB. Although this approach could be further pursued to realize smaller page sizes, it still cannot control error rates per byte as it just measures them and use appropriate pages.

Our previous work (Akiyama, 2020) investigates the source code of SPEC CPU 2006 benchmarks and shows that there are many applications that potentially suffer from the granularity gap problem. Besides adding more data for the source code analysis to further strengthen the analysis, the novelty of this paper is that we quantitatively analyze slowdown caused by the granularity gap problem and show experimental results on some benchmarks to further understand the significance of the granularity gap problem.

8. Conclusion

In this paper, we investigated the granularity gap problem of approximate memory. The problem arises due to the difference between approximation granularity and the granularity of data criticality of realistic applications. Because the former is as large as a few kilo bytes in realistic DRAM modules and the latter is often a few bytes, we cannot map data of these applications directly on approximate memory. We analyzed source code of SPEC CPU 2006 and 2017 benchmarks and found that 14 out of 22 benchmarks potentially suffer from this problem. In addition, we pointed out the applicability of a memory layout conversion technique to this problem and negative performance impact of it. We proposed a simulation framework to quantitatively analyze the negative performance impact of this technique, and found that the the performance can be degraded by up to 8.13 % in our tested cases. We conclude that the granularity gap problem is a significant issue and it requires more attention from the research community.

Acknowledgements.
This work was supported by JST, ACT-I Grant Number JPMJPR18U1, Japan.

References

  • S. Akiyama (2019) A lightweight method to evaluate effect of approximate memory with hardware performance monitors. IEICE Transactions on Information and Systems E102-D (12), pp. 2354–2365. Cited by: §2.1, §2.1, §3.1, §4.3.
  • S. Akiyama (2020) Assessing impact of data partitioning for approximate memory in c/c++ code. In The 10th Workshop on Systems for Post-Moore Architectures (SPMA), pp. 1 – 7. Cited by: item 1, §7.
  • R. A. Ashraf, R. Gioiosa, G. Kestor, R. F. DeMara, C. Cher, and P. Bose (2015) Understanding the propagation of transient errors in hpc applications. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 72:1–72:12. Cited by: §3.1.
  • J. S. S. T. ASSOCIATION (2010) JEDEC STANDARD: DDR3 SDRAM standard. Note: JESD79-3F Cited by: §2.2.
  • J. S. S. T. ASSOCIATION (2013) JEDEC STANDARD: DDR4 SDRAM. Note: JESD79-B Cited by: footnote 2.
  • D. Bakhvalov (2018) Advanced profiling topics. PEBS and LBR. Note: https:// easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR Cited by: §4.1.
  • N. Chandramoorthy, K. Swaminathan, M. Cochet, A. Paidimarri, S. Eldridge, R. V. Joshi, M. M. Ziegler, A. Buyuktosunoglu, and P. Bose (2019) Resilient low voltage accelerators for high energy efficiency. In International Symposium on High Performance Computer Architecture (HPCA), pp. 147–158. Cited by: §1, §1, §2.2, §3.1.
  • K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhimenko, S. Khan, and O. Mutlu (2016) Understanding latency variation in modern dram chips: experimental characterization, analysis, and optimization. In International Conference on Measurement and Modeling of Computer Science (SIGMETRICS), pp. 323–336. Cited by: §1, §1, §2.1, §2.1, §2.2, §2.3, footnote 2.
  • Y. Chen, X. Yang, F. Qiao, J. Han, Q. Wei, and H. Yang (2016) A multi-accuracy level approximate memory architecture based on data significance analysis. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 385–390. Cited by: §7.
  • Z. Chen, G. Li, K. Pattabiraman, and N. DeBardeleben (2019) BinFI: an efficient fault injector for safety-critical machine learning systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 69:1 – 69:23. Cited by: §1.
  • J. Choi, W. Shin, J. Jang, J. Suh, Y. Kwon, Y. Moon, and L. Kim (2015) Multiple clone row DRAM: a low latency and area optimized DRAM. In International Symposium on Computer Architecture (ISCA), pp. 223–234. Cited by: §1.
  • S. Curial, P. Zhao, J. N. Amaral, Y. Gao, S. Cui, R. Silvera, and R. Archambault (2008) MPADS: memory-pooling-assisted data splitting. In International Symposium on Memory Management (ISMM), pp. 101–110. Cited by: §5.1.
  • A. Das, H. Hassan, and O. Mutlu (2018) VRL-dram: improving dram performance via variable refresh latency. In Design Automation Conference (DAC), pp. 1–6. Cited by: §2.2, §6.5.
  • H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger (2012) Architecture support for disciplined approximate programming. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 301–312. Cited by: §2.2, §7.
  • Free Software Foundation, Inc (2019) GCC 4.8 Release Series Changes, New Features, and Fixes. Note: https://gcc.gnu.org/gcc-4.8/changes.html Cited by: §6.3.
  • K. Givaki, B. Salami, R. Hojabr, S. M. R. Tayaranian, A. Khonsari, D. Rahmati, S. Gorgin, A. Cristal, and O. S. Unsal (2020) On the resilience of deep learning for reduced-voltage fpgas. In International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 110–117. Cited by: §1, §3.1.
  • Q. Guo, K. Strauss, L. Ceze, and H. S. Malvar (2016) High-density image storage using approximate memory cells. SIGARCH Comput. Archit. News 44 (2), pp. 413–426. Cited by: §2.2.
  • H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu (2016) ChargeCache: reducing DRAM latency by exploiting row access locality. In International Symposium on High Performance Computer Architecture (HPCA), pp. 581–593. Cited by: §1.
  • Intel (2018) Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide. Cited by: §4.1.
  • B. Jacob, S. Ng, and D. Wang (2007) Memory systems: cache, dram, disk. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Cited by: §2.4, §2.4.
  • J. S. Kim, M. Patel, H. Hassan, L. Orosa, and O. Mutlu (2019) D-RaNGe: using commodity dram devices to generate true random numbers with low latency and high throughput. In International Symposium on High Performance Computer Architecture (HPCA), pp. 582–595. Cited by: §2.3.
  • J. Kim, M. Patel, H. Hassan, and O. Mutlu (2018) Solar-DRAM: reducing dram access latency by exploiting the variation in local bitlines. In IEEE International Conference on Computer Design (ICCD), pp. 282–291. Cited by: §2.1, §6.5.
  • S. Koppula, L. Orosa, A. G. Yağlıkçı, R. Azizi, T. Shahroodi, K. Kanellopoulos, and O. Mutlu (2019) EDEN: enabling energy-efficient, high-performance deep neural network inference using approximate dram. In International Symposium on Microarchitecture (Micro), pp. 166–181. Cited by: §1, §1, §2.1, §2.1, §3.1.
  • D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu (2015) Adaptive-latency dram: optimizing dram timing for the common-case. In International Symposium on High Performance Computer Architecture (HPCA), pp. 489–501. Cited by: §2.1.
  • D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko, V. Seshadri, and O. Mutlu (2017) Design-induced latency variation in modern dram chips: characterization, analysis, and latency reduction mechanisms. Proceedings of the ACM on Measurement and Analysis of Computing Systems, pp. 1 – 36. Cited by: §2.1.
  • Y. Lee, H. Kim, S. Hong, and S. Kim (2017) Partial row activation for low-power dram system. In International Symposium on High Performance Computer Architecture (HPCA), pp. 217–228. Cited by: §1.
  • J. Lin and P. Yew (2010) A compiler framework for general memory layout optimizations targeting structures. In Workshop on Interaction between Compilers and Computer Architecture (INTERACT), pp. 8:1 – 8:8. Cited by: §5.1.
  • S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn (2011) Flikker: saving dram refresh-power through critical data partitioning. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 213–224. Cited by: §7.
  • A. Mahmoud, S. K. S. Hari, C. W. Fletcher, S. V. Adve, C. Sakr, N. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai, and S. W. Keckler (2020) HarDNN: feature map vulnerability evaluation in cnns. In arXiv:2002.09786, pp. 1 – 14. Cited by: §1, §1, §3.1.
  • A. Mahmoud, R. Venkatagiri, K. Ahmed, S. Misailovic, D. Marinov, C. W. Fletcher, and S. V. Adve (2019) Minotaur: adapting software testing techniques for hardware errors. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 1087–1103. Cited by: §1, §3.1.
  • Micron (2018) 16Gb, 32Gb: x4, x8 3DS DDR4 SDRAM Description. Note: https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/ddr4/16gb_32gb_x4_x8_3ds_ddr4_sdram.pdf?rev=77c8db7a371 Cited by: §1, §2.4, footnote 3.
  • S. Miucin and A. Fedorova (2018) Data-driven spatial locality. In International Symposium on Memory Systems (MEMSYS), pp. 243–253. Cited by: §6.3.
  • D. T. Nguyen, N. H. Hung, H. Kim, and H. Lee (2020) An approximate memory architecture for energy saving in deep learning applications. IEEE Trans. on Circuits and Systems I: Regular Papers (), pp. 1–14. Cited by: §1, §7.
  • B. Nie, L. Yang, A. Jog, and E. Smirni (2018) Fault site pruning for practical reliability analysis of GPGPU applications. In International Symposium on Microarchitecture (Micro), pp. 750 – 762. Cited by: §1, §3.1.
  • Y. Nishi and B. Magyari-Kope (2019) Advances in non-volatile memory and storage technology, 2nd edition. Woodhead Publishing. Cited by: §2.4.
  • E. Petrank and D. Rawitz (2002) The hardness of cache conscious data placement. In Symposium on Principles of Programming Languages (POPL), pp. 101–112. Cited by: §5.1.
  • A. Raha, S. Sutar, H. Jayakumar, and V. Raghunathan (2017) Quality configurable approximate dram. IEEE Transactions on Computers 66 (7), pp. 1172–1187. Cited by: §1, §7.
  • K. Rupp (2020) Microprocessor trend data. Note: https://github.com/karlrupp/microprocessor-trend-data/ Cited by: footnote 1.
  • Samsung Electronics Co., Ltd. (2017) 8Gb C-die DDR4 SDRAM x16. Note: https://www.samsung.com/semiconductor/global.semi/file/resource/2017/12/x16%20only_8G_C_DDR4_Samsung_Spec_Rev1.5_Apr.17.pdf Cited by: §2.4, footnote 4.
  • G. Stazi, L. Adani, A. Mastrandrea, M. Olivieri, and F. Menichelli (2018) Impact of approximate memory data allocation on a h.264 software video encoder. In High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science, pp. 545–553. Cited by: §2.1.
  • B. Steensgaard (1996) Points-to analysis in almost linear time. In Symposium on Principles of Programming Languages (POPL), pp. 32–41. Cited by: §4.4, §5.1, §6.3.
  • K. Tovletoglou, L. Mukhanov, D. S. Nikolopoulos, and G. Karakonstantis (2020) HaRMony: heterogeneous-reliability memory and qos-aware energy management on virtualized servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 575–590. Cited by: §1, §2.1, §6.5.
  • Y. Wang, A. Tavakkol, L. Orosa, S. Ghose, N. M. Ghiasi, M. Patel, J. S. Kim, H. Hassan, M. Sadrosadati, and O. Mutlu (2018) Reducing DRAM latency via charge-level-aware look-ahead partial restoration. In IEEE/ACM International Symposium on Microarchitecture (Micro), pp. 298 – 311. Cited by: §1.
  • V. M. Weaver (2016) Advanced hardware profiling and sampling(pebs, ibs, etc.): creating a new papi sampling interface. Technical report Technical Report UMAINE-VMW-TR-PEBS-IBS-SAMPLING-2016-08, University of Maine. Cited by: §4.1.
  • J. Wei, A. Thomas, G. Li, and K. Pattabiraman (2014) Quantifying the accuracy of high-level fault injection techniques for hardware faults. In International Conference on Dependable Systems and Networks (DSN), pp. 375–382. Cited by: §3.1.
  • L. Yang and B. Murmann (2017)

    Approximate sram for energy-efficient, privacy-preserving convolutional neural networks

    .
    In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 689–694. Cited by: §2.2.
  • L. Ye, M. Lis, and A. Fedorova (2019) A unifying abstraction for data structure splicing. In International Symposium on Memory Systems (MEMSYS), pp. 173–183. Cited by: §6.3.
  • X. Zhang, Y. Zhang, B. R. Childers, and J. Yang (2016) Restore truncation for performance improvement in future DRAM systems. In International Symposium on High Performance Computer Architecture (HPCA), pp. 543–554. Cited by: §1, §2.2.
  • P. Zhao, S. Cui, Y. Gao, R. Silvera, and J. N. Amaral (2007) Forma: a framework for safe automatic array reshaping. ACM Transactions on Programming Languages and Systems 30 (1), pp. 1 – 29. External Links: ISSN 0164-0925 Cited by: §5.1, §6.3.