Efficient Similarity-aware Compression to Reduce Bit-writes in Non-Volatile Main Memory for Image-based Applications

05/07/2019 ∙ by Zhangyu Chen, et al. ∙ 0

Image bitmaps have been widely used in in-memory applications, which consume lots of storage space and energy. Compared with legacy DRAM, non-volatile memories (NVMs) are suitable for bitmap storage due to the salient features in capacity and power savings. However, NVMs suffer from higher latency and energy consumption in writes compared with reads. Although compressing data in write accesses to NVMs on-the-fly reduces the bit-writes in NVMs, existing precise or approximate compression schemes show limited performance improvements for data of bitmaps, due to the irregular data patterns and variance in data. We observe that the data containing bitmaps show the pixel-level similarity due to the analogous contents in adjacent pixels. By exploiting the pixel-level similarity, we propose SimCom, an efficient similarity-aware compression scheme in hardware layer, to compress data for each write access on-the-fly. The idea behind SimCom is to compress continuous similar words into the pairs of base words with runs. With the aid of domain knowledge of images, SimCom adaptively selects an appropriate compression mode to achieve an efficient trade-off between image quality and memory performance. We implement SimCom on GEM5 with NVMain and evaluate the performance with real-world workloads. Our results demonstrate that SimCom reduces 33.0 29.0



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Images have been widely used and stored in large-scale storage systems. Compared with encoded bits, image bitmaps (also called raster images) contain pixel-level information (e.g., image features (Lowe, 2004; Rublee et al., 2011)

) used by various applications (e.g., image processing, computer vision, and machine learning). In order to preserve the pixel-level information for these applications, images need to be stored as bitmaps in the main memory 

(Zhao et al., 2017). However, the storage of bitmaps demands a large amount of memory and energy in DRAM. Unlike it, Non-Volatile Memories (NVMs) are more suitable for bitmap storage due to the high density, DRAM-scale read latency, byte-addressability, and near-zero standby power (David et al., 2018). Using NVM-based main memory for image-based applications is promising to improve the efficiency in terms of storage density and energy consumption, thus improving the overall system performance.

NVMs offer high-density memory for image bitmaps, and the writes of bitmaps often cause performance degradation in NVM systems. NVMs, such as Phase Change Memory (PCM) and Resistive RAM (ReRAM), suffer from high write latency and non-negligible write energy consumption (Zhou et al., 2009; Lee et al., 2009; Li et al., 2013; Yue and Zhu, 2013; Zuo et al., 2018; Xu et al., 2017; Condit et al., 2009).

Workloads Ratio Workloads Ratio
jpeg 97.6% 2dconv 99.5%
sobel 99.3% debayer 25.3%
kmeans 27.2% histeq 99.1%
Table 1. The ratios of written data containing image bitmaps. (The granularity of data in a write access is equal to the cache block size, e.g., 64 bytes.)
Figure 1. The compression ratios of data writes containing image bitmaps using FPC and BDI.

In order to address the write-inefficiency problem, an intuitive solution is to compress images via software-layer coding algorithms (e.g., JPEG XR (Dufaux et al., 2009) and JPEG (Wallace, 1991)) before writing data into NVMs. Although software-layer image compression schemes can decrease the image sizes, they cause high overheads due to the high complexity. Many image-based applications need to access and manipulate bitmaps in pixels, thereby being unable to operate on compressed images. For example, the kernel in the sobel algorithm (Yazdanbakhsh et al., 2017) is used to read and update the pixels one by one. Software-layer compression/decompression before each access to images causes additional latency and decreases the system performance. Moreover, the intermediate bitmaps (decompressed for processing in image-based applications) still cause lots of writes to NVMs. For these applications, software-layer compression can generate more writes (i.e., compressed images and intermediate bitmaps) compared with simply storing bitmaps in NVMs. In the meantime, it’s impractical to implement image compression algorithms in the hardware layer (e.g., memory controllers) due to the high complexity and the small data size (e.g., 64 B) in each write access.

Recent works propose data compression inside the NVM module controller to reduce the bit-writes for each write access on-the-fly (Hong et al., 2018; Palangappa and Mohanram, 2016; Pekhimenko et al., 2012; Dgien et al., 2014). These hardware-layer compression schemes partition the data into words. Partitioned words are compressed in the NVM module controller using general-purpose data patterns, e.g., frequent patterns in Frequent Pattern Compression (FPC) (Dgien et al., 2014) and base words with small deltas in Base Delta Immediate (BDI) (Pekhimenko et al., 2012). After compression, compressed data are written to NVMs, thus efficiently reducing the bit-writes in NVMs. However, for write accesses of bitmaps, the partitioned words are hard to match the frequent patterns designed for general applications or satisfy the narrow value constraint of BDI due to the large variance, which results in high compression ratios (compressed data size relative to uncompressed data size). We evaluate the compression ratios of FPC and BDI using six image-based workloads (§5.1). The numbers in Table 1 denote the percentage of write accesses to NVMs containing bitmaps in each workload. The writes of bitmaps account for a large portion of NVM writes. However, as shown in Figure 1, the average compression ratios of FPC and BDI are 94.2% and 99.8%, which means most data writes of image bitmaps obtain poor compression performance and even become incompressible using precise compression schemes.

Recent research explored the approximate storage for images, since images tolerate minor inaccuracies (Guo et al., 2016; Zhao et al., 2017; Miguel et al., 2016). The approximate image storage proposed by Guo et al. (Guo et al., 2016) leverages the entropy and error correction requirement differences in the encoded bits of compressed images. However, the differences don’t exist in raw data (i.e., bitmaps). Recent works (Zhao et al., 2017; Miguel et al., 2016) exploit the inter-block similarity to provide approximate storage for bitmaps. However, searching for similar data in NVMs during each write access incurs extra latency and hardware overheads. Since a large portion of data to be written are approximable (as shown in Table 1), it is possible to improve the memory performance by approximately compressing the data on-the-fly before writing to NVMs. In order to efficiently reduce the bit-writes of bitmaps in NVM systems, there are two challenges for data compression.

Irregular Data Patterns. Data writes containing bitmaps are hard to match data patterns in existing compression schemes. Bitmaps consist of the bits of each pixel, and a typical pixel consists of 3 bytes (the pixel size can be different §2.1). Since the pixel size in common bitmaps (e.g., 3 B) is not the same as the word size in conventional compression schemes (e.g., 4 B), there is significant variance in partitioned words. Besides, the value of each word depends on the contents of bitmaps. Therefore, the partitioned words in conventional schemes show irregular data patterns, which lead to poor compression performance.

Bitmap Format Variance. When multiple applications (or threads) are running on top of NVM systems with different bitmap formats, write accesses to NVMs contain different data layouts. In the meantime, the persistence order is determined by the cache replacement policy, which is different from the program order (Zuo et al., 2018; Shin et al., 2017). Due to the reordering, it’s challenging to determine the bitmap format for each write access. Data compression designed for one bitmap format may fail in others due to the significant changes in data patterns.

Bidirectional precision scaling (Ranjan et al., 2017a) partitions data using annotated word size and conducts approximate compression for approximable data. Specifically, it approximately truncates Most Significant Bits (MSB) and Least Significant Bits (LSB) of error-tolerant data within the accuracy constraint. However, the pixel value in bitmaps is often stored using the smallest data type, in which identical MSBs are usually unavailable in bitmaps. Moreover, indiscriminately truncating LSBs reduces the color depth and causes noticeable quality degradation.

To address the above two challenges, we propose SimCom, an efficient similarity-aware compression scheme inside the NVM module controller, to reduce the bit-writes of bitmaps into NVMs, thus improving the write performance and decreasing the energy consumption of NVMs for image-based applications. For the first challenge, we leverage the pixel-level similarity in data writes of bitmaps and only write a base word (the representative word for a group of continuous similar words) with a run (the number of words in the group) for each group of continuous similar words, which eliminates the writes of similar words in NVMs. For the second challenge, SimCom executes compression modes in parallel and adaptively selects an efficient compression mode without programmer annotations on image metadata.

While we use RGB color model to illustrate the compression for data writes of bitmaps, SimCom can be adapted to other color models that show the pixel-level similarity (e.g., YUV and YCbCr). In addition to images, SimCom also works for other error-tolerant data, as long as these data consist of data units of fixed size and similarity exists in adjacent units, e.g., videos and audio signals.

In SimCom, we make the following contributions:

  • Similarity-aware Compression. By leverage the pixel-level similarity, we develop an efficient approximate data compression scheme in hardware layer to reduce the bit-writes of image bitmaps in NVMs on-the-fly.

  • Adaptiveness. With the domain knowledge of bitmaps, we propose an adaptive scheme to perform approximate compression without prior knowledge about data formats. SimCom eliminates the annotations on the data types and widths of bitmaps.

  • System Implementation. We have implemented the prototype of SimCom in GEM5 with NVMain and conducted experiments with read-world workloads in various domains. Results show that compared with state-of-the-art works, SimCom achieves average 33.0%, 34.8% write latency reduction and 28.3%, 29.0% energy savings over FPC and BDI with 3% quality loss.

2. Background and Motivation

2.1. Image Bitmap

Structure Organization. An image bitmap is a pixel storage structure containing the bits for each pixel color. The bits of a pixel color consist of primary colors. A channel in a bitmap is an image-size array of one primary color in each pixel. A typical bitmap consists of 3 channels (e.g., red, green, and blue). For each pixel, the number of bits per channel is 8. Some bitmaps contain an optional channel, alpha channel, to store transparency information (Porter and Duff, 1984; Duff, 2017). We use channel count (CC) to represent the number of channels in a bitmap and bits per channel (BPC) to denote the number of bits per channel.

Quality Metric. Root-mean-square error (RMSE) is an objective metric to measure the quality of an image, which accounts for the difference of each pixel compared with a baseline image. The RMSE of image with respect to baseline image is calculated using Equation 1. denotes the number of pixels in each image. The value of RMSE ranges from 0 to 1 and the lower value is better performance. We use RMSE to measure the output quality of relaxed images like prior works (Ranjan et al., 2017a; Yazdanbakhsh et al., 2017).


2.2. Bit-write Reduction in NVMs

To address the high latency and energy consumption in write operations, bit-write reduction techniques are widely used in NVM-based main memory (Yang et al., 2007; Cho and Lee, 2009; Pekhimenko et al., 2012; Yue and Zhu, 2013; Dgien et al., 2014; Palangappa and Mohanram, 2016; Xu et al., 2018a; Guo et al., 2018; Palangappa and Mohanram, 2018). Related schemes include data encoding (Yang et al., 2007; Cho and Lee, 2009; Jacobvitz et al., 2013), data compression (Pekhimenko et al., 2012; Dgien et al., 2014; Guo et al., 2018), and their combinations (Palangappa and Mohanram, 2016; Xu et al., 2018a; Palangappa and Mohanram, 2018). Before writing data into NVMs, compression schemes decrease the size of data to be written by data compression. Compressed data are decompressed for read accesses. Data encoding schemes are used to reduce the bit flips in write operations. Encoding technologies can be leveraged to encode the compressed data for energy efficiency (Palangappa and Mohanram, 2016) and lifetime improvement (Xu et al., 2018a).

Figure 2. An example of leveraging pixel-level similarity to compress data writes.

2.3. Approximate Storage

Approximate storage leverages the error-tolerance of approximable data to slightly relax the accuracy constraints for improvement in performance, data density, lifetime, and energy efficiency. Approximable data are interpreted as the data tolerating minor inaccuracies, which are represented as image bitmaps in the context of this paper. For error-tolerant applications, typical approximation consists of three steps: identification of approximable data, approximate techniques, and quality control. Before execution, error-tolerant data should be separated from raw application data, which is accomplished by programmer annotations (Liu et al., 2011; Sampson et al., 2011, 2015; Miguel et al., 2015; Miguel et al., 2016; Ranjan et al., 2017b) and domain knowledge (Guo et al., 2016; Jevdjic et al., 2017). For error-tolerant data, traditional guarantees for accuracy in the storage systems are relaxed for gains in memory performance and efficiency. Existing approximate techniques include decreasing refresh rate (Liu et al., 2011) and lowering voltage (Esmaeilzadeh et al., 2012) in DRAM, using worn blocks and skipping program-and-verify iterations in Multi-Level Cell (MLC) PCM (Sampson et al., 2013), associating similar cache blocks with the same tag entry (Miguel et al., 2015; Miguel et al., 2016), and utilizing selective error correction code (Guo et al., 2016; Jevdjic et al., 2017). Given accuracy constraints, in order to obtain the results, we need to select appropriate approximation parameters (Miguel et al., 2015; Zhao et al., 2017) to achieve suitable trade-off between output quality and performance. The parameters can be inferred dynamically by monitoring the intermediate results (Baek and Chilimbi, 2010; Samadi et al., 2014), using the input features (Sui et al., 2016), and tuning with canary inputs (Laurenzano et al., 2016; Xu et al., 2018b).

2.4. Pixel-level Similarity

Pixel-level similarity is interpreted as the similarity of words in the data of a write access to NVMs. Instead of fixed 4-byte word size, the data in SimCom are partitioned at pixel-level granularity (e.g., 3 bytes for RGB format, more details are available in §3.3). In a bitmap, each pixel describes the color of a tiny point of the image. Hence, adjacent pixels tend to have similar contents. As shown in Figure 2, the contents of adjacent pixels A, B, C, and D are similar. For the storage of an image bitmap, the contents usually are mapped to a continuous region in memory and have continuous addresses in the address space. When a write access of bitmap is issued to NVM module and we partition the data at the boundaries of pixels, partitioned words are possible to be similar due to the analogous contents in adjacent pixels. This paper proposes to leverage the pixel-level similarity in data for approximate compression, thus reducing the data size and improving the memory performance.

Figure 3. The ratios of continuous similar words in approximable data with different error thresholds.

We have conducted experiments to verify the pixel-level similarity in write accesses to NVMs by recording continuous similar words in approximable data, i.e., data containing image bitmaps. The similarity metric is described in §3.2. Approximable data is partitioned at the pixel boundaries (it is possible to not partition at the boundaries, more details in §3.3

). In order to present a conservative estimation of pixel-level similarity in data writes, we filter out the bytes that do not form complete pixels. Error thresholds range from 0% (precise) to 100% (maximal approximation), which denotes approximation degrees. Continuous similar words are interpreted as a group of sequential words, in which any two words are similar. The details of experimental settings are described in §

5.1. Figure 3 shows the percentage of continuous similar words in approximable data with different error thresholds. When we increase the error threshold, the ratio of continuous similar words increases up to 82.8% on average.

The pixel-level similarity is common in bitmaps due to two reasons: (1) The changes in the contents of images are generally slight. For example, most backgrounds of images consist of similar colors and lack of abrupt changes. (2) The resolution of images is high. With higher resolution for advanced sensors and application requirements, the number of pixels corresponding to one item increases and the difference between two adjacent pixels decreases. The common similarity of pixels in images offers the opportunity for approximate compression.

It is worth noting that even when the error threshold is 0%, the ratio of continuous similar words is still more than 4.5% and up to 46.5%. The substantial similarity in images motivates us to exploit the pixel-level similarity for more bit-write reduction in NVMs.

3. Similarity-aware Data Compression

3.1. Design Overview

Figure 4. The architecture overview of SimCom.

For NVM systems, image-based applications incur many incompressible writes to NVMs, which result in high write latency and energy consumption. SimCom achieves high compression performance via the pixel-level similarity in data writes. The efficient approximate compression reduces the data size and improves the memory performance in NVM systems.

Figure 4 shows the hardware architecture overview of SimCom. Specifically, Adaptive Approximate Compression Logic and Decompression Logic respectively implement the compression and decompression schemes of SimCom. Quality Table is an on-chip cache (Ranjan et al., 2017b; Zhao et al., 2017), which stores the start and end addresses of memory regions with Approximation Factor as AF (i.e., approximation degrees) for images. These addresses and AF are specified through programmer annotations (Sampson et al., 2011, 2015; Miguel et al., 2015; Ranjan et al., 2017b) and transported to Quality Table via ISA extensions (Esmaeilzadeh et al., 2012). An approximable bit in a tag entry indicates the precision of data blocks in caches (Esmaeilzadeh et al., 2012; Miguel et al., 2015).

In a write access, we assume approximable data are indicated by approximable bits like prior works (Esmaeilzadeh et al., 2012; Miguel et al., 2015). For approximable data, Adaptive Approximate Compression Logic finds continuous similar words and compresses them into base words and runs. For precise data, i.e., data containing any data except for bitmaps, Precise Compression Logic uses existing precise compression schemes instead (e.g., FPC (Dgien et al., 2014)). In a read access, compressed data indicated by compressible bits are decompressed in Decompression Logic before responding to requests. In the decompression stage, if the approximable bit is set to 1, the approximate decompression is used; otherwise, the precise decompression is used.

3.2. Similarity Metric

We use normalized difference between two partitioned words to quantify the similarity. Only if the normalized difference is smaller than AF, two words are considered to be similar. The normalized difference between words and is calculated using Equation 2.


The difference between words and is normalized to the maximal value of per channel for a pixel (), which is determined by BPC. When BPC is 8, is 255.

3.3. Uniform Data Partition

In order to efficiently compress the data writes, we need to partition the data without decreasing the similarity in bitmaps. The intuitive solution is to partition the data at the pixel boundaries whereby the partitioned words would be similar, since these words correspond to adjacent pixels in bitmaps. However, how to identify pixel boundaries in data becomes a new challenge. We can’t figure out the positions of pixel boundaries without additional context information, such as the offset of the data in bitmaps and the corresponding bitmap format. A straightforward solution is to allocate each pixel with a fixed alignment (the alignment should be a factor of the data write size, e.g., 4 B), thus enabling fixed pixel boundary positions in data. However, when the actual pixel size (e.g., 3 B) mismatches the alignment, it significantly decreases storage density and wastes storage space.

In order to preserve the similarity in partitioned words with low overheads, we propose a uniform scheme to partition the data in a write access. Due to the pixel-level similarity, the data form an approximate periodic cycle of the pixel size. Therefore, we advocate partitioning at the granularity of pixel size and leave the possible remaining bytes (when the data size isn’t a multiple of the pixel size) at the end as a partial word, called remainder. For example, if the pixel size is 3 bytes and the data write size is 64 bytes, data are divided into 21 words of 3 bytes and 1 partial word of 1 byte. As shown in Figure 6, though the partitioned positions are not pixel boundaries, the words containing data from adjacent pixels are similar due to the periodic cycle.

3.4. Search for Continuous Similar Words

Since continuous similar words require that any two words are similar (§2.4), the time complexity to get a group of exact continuous similar words is ( denotes the number of words). The high time complexity incurs high latency and hardware overhead to accurately find out all continuous similar words in a write access.

In order to alleviate the cost of searching similar words during compression, we propose to approximately search for continuous similar words. Specifically, we slightly relaxed the requirements for each group of continuous similar words. The words in relaxed continuous similar words are only required to be similar to the base word (for simplicity, we use continuous similar words to represent relaxed continuous similar words in the following text unless specified). The accuracy of approximate search is still constrained by the AF. The reason is that even with the approximation in search, the normalized difference threshold for each group is only 2AF.

Though the appropriate candidate of base word for a group of similar words is the average, we take the first word as the base word for two reasons (in the following text, we use base word and base interchangeably): (1) Taking the first word as the base simplifies the compression logic; (2) Due to the selection of the first word as a base, the compression performance loss is slight.

With the relaxation in similarity and selection of the base for continuous similar words, the time complexity of getting continuous similar words decreases to , which efficiently decreases the complexity of compression logic and improves the compression performance.

3.5. Compression-aware NVMs

Figure 5. The workflow of approximate compression.

Write: If the approximable bit of a write access is set to 1, approximate compression is used. Figure 5 illustrates the workflow of approximate compression for incoming data. After uniformly partitioning the data into words and possible remainder

, SimCom sets the first word as the initial base

. The Word Processing Unit (Word-PU) uses Equation 2 to calculate the normalized difference between the base and rest words. If the normalized difference is larger than the AF, current values of base and run are written into compressed data. Current word is set as the new base and the run is reset to 0

. If the normalized difference is no larger than the threshold, SimCom only increases the run by one

. After processing each word, SimCom records the last pair of base and run in the compressed data

. When the remainder is available in the current partition, the Remainder Processing Unit (Remainder-PU) obtains the normalized difference between the remainder and the last base using Equation 2 with the CC substituted by the number of channels in the remainder. If the normalized difference is larger than the AF, SimCom writes the remainder and sets the Most Significant Bit (MSB) of the run, called remainder bit, to indicate the existence of the remainder in the compressed data. SimCom uses the first byte of compressed data to record the number of bases

. If the approximable bit is reset to 0, existing precise compression schemes (e.g., FPC) are used. For both approximate and precise compression, if compressed data size is smaller than that of original data, SimCom writes compressed data and sets the compressible bit to 1; otherwise, SimCom writes uncompressed data and resets the compressible bit to 0.

Read: If both the compressible bit and the approximable bit are set to 1, approximate decompression is used to reconstruct the stored data. Specifically, for each pair of base and run in compressed data, the base is used to fill in the decompressed data multiple times according to the run. The remainder bit is checked: if set to 1, the remainder in the compressed data is used to complement the decompressed data; otherwise, the last base is used. If only the compressible bit is 1, compressed data are reconstructed by the inverse procedure of precise compression (e.g., FPC). If the compressible bit is 0, the data would bypass the decompression procedure and respond to read accesses.

Figure 6. An example of approximate compression and decompression scheme. (AF = 0.05)

Figure 6 shows an example of approximate compression/decompression for data writes of 16 bytes. Although the partitioned positions aren’t boundaries among pixels A, B, C, and D, partitioned words show pixel-level similarity and form a group of continuous similar words with a remainder of one byte. By default, the first word in similar words is selected as the base. In this example, the normalized difference between the remainder and the base exceeds AF. Therefore, the remainder is placed at the end of compressed data and the remainder bit of last run is 1. After compression, the data size is reduced by 10 bytes. During decompression, the base is used to fill in similar words (i.e., the shadowed bytes in the figure).

4. Adaptive Approximate Compression

In order to handle different bitmap formats in the compression/decompression logic, the approximate compression proposed in §3 requires extra metadata including CC and BPC. Though it’s possible to annotate the metadata to be stored in cache tags (Miguel et al., 2015) or address table in memory controllers (Ranjan et al., 2017b, a), these techniques cause additional overheads and programmer annotations. Moreover, users need to confirm the bitmap formats and annotate these metadata before execution. Hence, in this section, we propose to leverage the image characteristics and adaptively select the appropriate mode for data compression without additional programmer annotations.

4.1. Adaptive Compression Scheme

1) Why use predefined compression modes for different image formats? The images generally include grayscale and color images. Grayscale images contain only one channel and the color images in RGB color space consist of red, green, blue, as well as the optional alpha channel. In other color spaces (e.g., YUV), similar components (e.g., one luminance channel and two chrominance channels) exist. The BPC in common images is 8 bits representing 256 levels in each channel. 24 bits per pixel () represent more than 16 million colors, while the number of colors discriminated by the human eye is up to 10 million (Judd, 1975; Leong, 2006). For applications processing HD (high-definition) images, 16 bits per channel is enough to encode the necessary colors. Therefore, we propose to use six compression modes to handle different image formats. The options for CC are 1 (e.g., grayscale), 3 (e.g., RGB), and 4 (e.g., RGB with the alpha channel) and the options for BPC are 8 and 16.

Figure 7. Adaptive compression scheme overview. (The two integers in each compression mode denote the number of channels and the bytes per channel.)

2) How to determine the suitable compression mode for a write access? A straightforward approach is to sample some write accesses for an efficient compression mode to be applied on later NVM writes. Sampling works when all write accesses have regular pattern formats (e.g., all applications using one bitmap format). However, sampling often fails when data writes have random pattern formats (e.g., applications using different bitmap formats are running in an NVM system). Instead of sampling, SimCom performs six compression modes in parallel and selects the compression mode with the minimal mean difference. Mean difference indicates the average difference between every two adjacent words in data. We observe that the mean difference of the right compression mode (i.e., the mode matching the bitmap format) is minimal, which makes sense due to the pixel-level similarity. Figure 7 shows the overview of adaptive compression scheme used in SimCom. Six compression modes with different CC and BPC process data in parallel. Mode Selector first selects the mode with the minimal mean difference. If multiple modes have the minimal mean difference, Mode Selector chooses the one with the minimal compressed data size. For the simplicity of compression logic, SimCom reuses the normalized difference between each word and the base as the difference between adjacent words. Due to the error-tolerance of application and the similarity between words and their bases, the reuse of normalized difference is acceptable. The experimental evaluation in §5 verifies the correctness of mode selection and result reuse.

4.2. Metadata Management

There are two classes of metadata in SimCom. The first-class metadata are used for approximately compressed data in SimCom including the choice of compression mode, the number of bases, and one remainder bit. When using compression mode, each pair of base and run occupy 2 bytes. Therefore, there are no more than 32 pairs of base and run in compressed data with 64-byte write data granularity. For other compression modes, the number of bases in the compressed data is smaller. As a result, we only use 5 bits to record the number of bases. SimCom uses the first byte of the compressed data as metadata. The highest 3 bits of metadata are used to encode the choice of compression mode and the rest 5 bits for the number of bases in the compressed data. As described in §3.5, the MSB of the last run is used as the remainder bit to indicate if a remainder exists in the compressed data. Hence, the first-class metadata are stored in compressed data. The second-class metadata are used for approximate compression including one approximable bit and one compressible bit (§3.1). SimCom stores the approximable and compressible bits in a separate region in NVMs like prior works (Pekhimenko et al., 2012; Dgien et al., 2014; Miguel et al., 2015; Ranjan et al., 2017a). However, the two bits can be packed into compressed data to reduce NVM accesses and improve memory bandwidth like recent works (Hong et al., 2018; Young et al., 2018).

5. Performance Evaluation

5.1. Experimental Setting

CPU 1 core, X86-64 processor, 2 GHz
L1 I/D cache 32 KB, 2 ways, LRU
L2 cache 1024 KB, 8 ways, LRU
Cache block size 64 B
Main Memory using PCM
Memory controller FCFRFS
Read/Write latency 120 ns/150 ns
Memory organization 4 GB, 8 B write unit size
Table 2. System configurations.

We implement SimCom in GEM5 (Binkert et al., 2011) with NVMain (Poremba et al., 2015). The system configurations of GEM5 and NVMain are listed in Table 2. Since SimCom focuses on data compression and is orthogonal to the underlying memory model, we simply use a First Ready First Come First Serve (FRFCFS) memory controller to serve NVM accesses. We evaluate the performance with six image-based workloads, i.e., jpeg, sobel, and kmeans from AxBench (Yazdanbakhsh et al., 2017) and 2dconv, debayer, and histeq from PERFECT (Barker et al., 2013). These workloads are selected for various domains, jpeg for compression, sobel, 2dconv, debayer, and histeq for image processing, and kmeans for machine learning. The ratios of approximable data in these workloads are shown in Table 1. As suggested in (Yazdanbakhsh et al., 2017), we use RMSE (§2.1) as the metric to measure the output error (i.e., the quality of output image) compared with the precise compression result. The input images come from Kodak dataset (Franzen, [n. d.]). The output errors are reported using the average RMSE of 6 images. Before running these workloads, we warm up the system with 100 million instructions.

(a) Bit-write Ratio
(b) Write Latency
(c) Total Energy Consumption
Figure 8. The performance of jpeg: bit-write ratio, write latency, energy consumption with various output errors.
(a) Bit-write Ratio
(b) Write Latency
(c) Total Energy Consumption
Figure 9. The performance of sobel: bit-write ratio, write latency, energy consumption with various output errors.

We have evaluated the following compression schemes (FNW (Cho and Lee, 2009) is used to further reduce bit-flips in all schemes):

  • [leftmargin=*]

  • FPC: FPC (Dgien et al., 2014) exploits the general frequent patterns and compresses the matched words with short prefix bits. For fair comparisons, we enhance this scheme by adding approximation. Specifically, when the difference between a partitioned word and a relaxed word matching a frequent pattern is within the error constraint, the pattern is used to compress the word.

  • BDI: BDI (Pekhimenko et al., 2012) leverages the narrow value characteristics of array and compresses cache block data into bases with small deltas. This scheme is an approximate version of BDI (Pekhimenko et al., 2012). It relaxes narrow value constraint and compresses the words that slightly overflow the delta limit.

  • BiScaling: This scheme uses bidirectional precision scaling to approximately compress the data to be written (Ranjan et al., 2017a).

  • ApproxCom: This is our proposed scheme that leverages the pixel-level similarity for approximate compression.

  • SimCom: This is our proposed scheme leveraging pixel-level similarity and adaptive compression (i.e., ApproxCom + adaptive compression), which eliminates the annotations on data formats used in BiScaling and ApproxCom.

Since BiScaling, ApproxCom, and SimCom focus on approximate compression on approximable data, we use precise FPC to compress precise data in these schemes.

Schemes Approximation Factor (AF) Channel Count (CC) Bits Per Channel (BPC)
Table 3. The annotation requirements in compression schemes. (✓: require the annotation; ✗: no requirements for the annotation.)
(a) Bit-write Ratio
(b) Write Latency
(c) Total Energy Consumption
Figure 10. The performance of kmeans: bit-write ratio, write latency, energy consumption with various output errors.
(a) Bit-write Ratio
(b) Write Latency
(c) Total Energy Consumption
Figure 11. The performance of 2dconv: bit-write ratio, write latency, energy consumption with various output errors.

We leverage programmer annotations (Sampson et al., 2011, 2015) and ISA extensions (Esmaeilzadeh et al., 2012) to deliver necessary information into storage systems like prior works (Esmaeilzadeh et al., 2012; Miguel et al., 2015; Miguel et al., 2016; Ranjan et al., 2017b). Programmer annotations are mature techniques and widely used in approximate storage systems (Sampson et al., 2011; Esmaeilzadeh et al., 2012; Miguel et al., 2015; Miguel et al., 2016; Ranjan et al., 2017b). We use programmer annotations to annotate bitmaps as approximable data in workloads. Through ISA extensions, write accesses with approximable data are identified and processed by approximate compression logics. Table 3 shows the required annotations for all compression schemes.

(a) Bit-write Ratio
(b) Write Latency
(c) Total Energy Consumption
Figure 12. The performance of debayer: bit-write ratio, write latency, energy consumption with various output errors.
(a) Bit-write Ratio
(b) Write Latency
(c) Total Energy Consumption
Figure 13. The performance of histeq: bit-write ratio, write latency, energy consumption with various output errors.

5.2. Quality Efficiency

Figures 8-13 show the the memory performance improvement in terms of bit-write reduction, write latency, and energy consumption with various output errors. During our experiments, we observe serious quality degeneration when the output error approaches to 10%. Therefore, we only plot the curves with output errors under 10%.

Bit-write Reduction. Figures 7(a)-12(a) show the bit-write reduction with different output errors. The bit-write ratio denotes the percentage of bit-write after compression and FNW. A lower bit-write ratio implies a higher NVM performance improvement. With the increase of output errors, the bit-write ratios in all approximate compression schemes decrease. Due to the efficiency of pixel-level similarity, ApproxCom and SimCom generate less bit-writes than other approximate compression schemes with most output errors. SimCom achieves 8.6%/10.3%/8.3% lower bit-write ratios on average than FPC/BDI/BiScaling with the same output error of 3% (the same constraint following (Ranjan et al., 2017a)). When the output error increases to 5%, the average decreases of bit-write ratios become 9.2%/11.1%/8.5%. In kmeans and debayer, the benefits of approximation decrease due to the smaller ratios of approximable data than other workloads, as shown in Table 1. Due to the flexibility of adaptive compression, SimCom obtains slightly lower bit-write ratios than ApproxCom.

Write Latency. Figures 7(b)-12(b) show the write latency with different output errors. Due to the electric current constraint in NVMs, the write operation is divided into several serial write units (Cho and Lee, 2009; Yue and Zhu, 2013). Therefore, the write latency mainly depends on the data size. Compared with precise FPC and precise BDI, SimCom gains average write latency reduction of 33.0% and 34.8% subject to 3% output error. When the constraint of quality loss is relaxed to 5%, the average reduction of write latency become 38.2% and 40.0%. For approximate compression schemes, the superiority of SimCom in terms of bit-write ratio turns into the benefits in write latency. Under 3% and 5% quality loss, SimCom achieves average 21.5%/28.2%/30.3% and 24.0%/30.1%/31.6% write latency reduction compared with FPC/BDI/BiScaling, respectively.

Energy Consumption. Figures 7(c)-12(c) show the energy consumption with various quality loss. Since the energy consumed in the programming process is the main fraction in total energy consumption of NVMs (Palangappa and Mohanram, 2016), the number of bit-writes determines the energy consumption. By decreasing only 3% output quality, SimCom obtains 28.3% and 29.0% energy savings than precise FPC and precise BDI. The average energy savings become 34.7% and 35.2% when the quality loss constraint is 5%. Compared with approximate compression schemes, SimCom reduces the consumed energy by 19.1%/23.0%/21.6% and 22.4%/26.3%/24.0% than FPC/BDI/BiScaling with 3% and 5% quality loss, respectively.

5.3. Breakdown of Bit-write Reduction

In order to evaluate the contribution of different techniques (i.e., precise compression for precise data, approximate compression for approximable data, and FNW for compressed data) in all evaluated schemes, we record the bit-write reduction from each technique and present the results in Figures 13(a)-13(b). We set the output error constraints as 3% (Ranjan et al., 2017a) and 5% (aggressive approximation) for evaluation.

As shown in Figures 13(a)-13(b), SimCom gains the largest bit-write reduction from the approximate compression for error-tolerant data, thus obtaining the advantage of bit-write reduction over other schemes. Though the percentages of precise data are large in debayer and histeq, the precise compression performance is poor due to the pattern mismatch and irregular data types in these workloads, thus resulting in the inefficiency of precise compression (i.e., FPC and BDI).

(a) output error < 3%
(b) output error < 5%
Figure 14. Breakdown of bit-writes reduction with various output error.

5.4. Output Quality

Figures 15-20 show the output images subject to output error constraint of 3% with the original images from precise computation (Ranjan et al., 2017a)

. Under 3% quality loss, the visual difference between relaxed output images and original images is slight. The selection of error constraint may be different in applications, which requires the knowledge on the accuracy requirements of applications. For the applications with high endurance for noises in input data, such as feature extraction and machine learning, the error constraint can be set relatively larger than the applications requiring high accuracy. Due to the difference in image processing algorithms, the

AF for a specific output error constraint varies in different workloads. A practical way to determine the AF is to search the suitable AF using small canary inputs and leverage the inferred AF on full size inputs (Laurenzano et al., 2016; Xu et al., 2018b).

(a) original output
(b) output error < 3%
Figure 15. Output quality of jpeg.
(a) original output
(b) output error < 3%
Figure 16. Output quality of sobel.
(a) original output
(b) output error < 3%
Figure 17. Output quality of kmeans.
(a) original output
(b) output error < 3%
Figure 18. Output quality of 2dconv.
(a) original output
(b) output error < 3%
Figure 19. Output quality of debayer.
(a) original output
(b) output error < 3%
Figure 20. Output quality of histeq.

5.5. Adaptability for Bitmap Format Variance

Figure 21. The bit-write ratio in jpeg with bitmaps of different formats. (A bitmap format of (m, n) indicates CC = m and BPC = n.)

In order to verify the adaptability of SimCom, we evaluate the jpeg workload with input images of different formats. As shown in Figure 21, SimCom achieves comparable bit-write ratios to those of ApproxCom (within 1%). Without annotations on bitmap formats, SimCom is able to infer the data types according to the mean difference (§4.1) among data. The pixel-level similarity in data guarantees that the right compression mode (§4) tends to obtain the minimal mean difference.

Mode Ratio (%) Bitmap Formats
(1, 8) (3, 8) (4, 8) (1, 16) (2, 16) (4, 16)
1C1B 82.4 47.2 0.6 0.2 0.2 0.2
3C1B 0.2 34.1 0 0 0 0
4C1B 0.1 0.4 96.9 0 0 0
1C2B 15.3 7.4 0 98.5 58.3 1.0
3C2B 0 7.1 0 0.2 38.5 0
4C2B 0.1 0.1 2.2 0.6 1.6 97.9
Incompressible 1.9 3.7 0.3 0.5 1.4 0.9
Table 4. Statistics for the Mode Selector in SimCom with output error < 3%. (The two integers in bitmap formats denote CC and BPC, respectively.)

An interesting point is that SimCom obtains slightly lower bit-write ratios when CC is 3 (e.g., bitmap formats of (3, 8) and (3, 16) when the output error threshold is 3% and 5%, respectively) otherwise marginally higher bit-write ratios than ApproxCom. Since the Mode Selector selects the compression mode with minimal mean difference, it is possible to select a compression mode with slightly smaller mean difference but much larger compressed data size than the right compression mode. The conservative strategy used in SimCom brings minor performance decrease, as shown in Figure 21. We record the selection of mode inside the Mode Selector of SimCom. Table 4 shows SimCom is able to obtain the right compression mode in most cases (the numbers in boldface). However, when CC is 3, the Mode Selector possibly chooses the mode in which CC is 1. The reason is that when the values of three channels in a pixel are identical, e.g., pixels of white color and grayscale images stored in RGB formats, a compression mode with one channel can achieve the same mean difference with smaller compressed data size than the right compression mode. Therefore, SimCom achieves the adaptiveness in the mode selection and low bit-write ratios for various bitmap formats.

5.6. Discussion

The Principle of Approximate Compression Mode. There are six available compression modes in SimCom. Compression modes with other CCs and BPCs can be added into SimCom like existing modes. For images with BPC > 16, an alternative approach is to downscale the precision to fit the images for predefined compression modes in SimCom. Due to the error-tolerance in images, slight precision downscaling of images with large BPC would not cause significant quality loss.

Hardware Overhead of SimCom. The majority of hardware overhead of SimCom comes from the parallel execution using six approximate compression logics, which can be optimized by reusing the logic. In this case, six compression modes are executed one by one in the compression logic, which trades the compression speed for the hardware efficiency.

Architecture Support for SimCom. In the current testbed, SimCom requires architecture support (i.e., microarchitecture modifications and ISA extensions) like prior works (Esmaeilzadeh et al., 2012; Miguel et al., 2015; Miguel et al., 2016; Ranjan et al., 2017b) to identify write accesses with approximable data, which are widely-used techniques in approximate storage systems (Sampson et al., 2011; Esmaeilzadeh et al., 2012; Sampson et al., 2015; Miguel et al., 2015; Miguel et al., 2016; Ranjan et al., 2017b). For image-based applications (e.g., machine learning and computer vision), memory performance and energy efficiency are important for entire system performance. In addition, these applications are generally tolerant for minor errors. Moreover, power consumption is constrained in specific platforms (e.g., smartphones and embedded devices). Therefore, it is meaningful to provide architecture support for approximate storage systems. With architecture support like Truffle (Esmaeilzadeh et al., 2012), SimCom only requires small hardware changes in the NVM module controller. In the meantime, it’s possible to deliver accuracy requirements via software interfaces without ISA extensions (Ranjan et al., 2017a; Zhao et al., 2017; Liu et al., 2011). Through the interfaces, approximable data are stored in a separate memory region. Read or write accesses to the region can be identified by memory addresses, thus mitigating requirements for architecture support.

6. Related Work

Data Compression in NVMs. Data compression schemes compress data to reduce the bits written to NVMs. FPC (Dgien et al., 2014) uses static data patterns to compress frequent patterns into short prefix bits with remaining bits. However, the data patterns optimized for the generalization in various applications don’t match bitmaps, resulting in poor compression performance for image-based applications. BDI (Pekhimenko et al., 2012) leverages the characteristics of narrow values in arrays to encode each word using bases with small deltas. As shown in §5, it’s difficult for bitmaps to satisfy the constraints of FPC or BDI even with approximation. Different from FPC and BDI, SimCom leverages the pixel-level similarity in bitmaps and efficiently trades slight output quality for performance improvement.

Approximate Image Storage. To address the challenge of massive image collections, several approximation approaches are proposed to improve the efficiency of image storage. A biased MLC write scheme (Guo et al., 2016; Jevdjic et al., 2017) is used to balance the drift and write errors in MLC PCM. Selective ECC is applied on images according to the importance of encoded bits (Guo et al., 2016). Progressively encoding scheme can improve the read performance of images (Yan et al., 2017). However, these schemes are established based on the significant entropy differences in encoded image bits, which don’t exist in bitmaps. Therefore, encoded image approximation is inefficient for the writes of bitmaps in NVMs. Recent work (Zhao et al., 2017) proposes to selectively write pixels in approximate window by writing soft bits in MLC STT-MRAM main memory. The approximation is efficient when loading entire images from disks to MLC STT-MRAM. However, it is limited to MLC STT-MRAM and needs searching for similar contents in other memory blocks, which leverages inter-block similarity and leads to additional hardware overheads and latency when writing data from cache to NVMs.

Approximate Cache & Main Memory. The accesses to memory can be served with predicted values according to previous data patterns (Miguel et al., 2014). Doppelgänger associates similar data blocks with a single tag entry (Miguel et al., 2015). Bunker cache leverages the spatial-value similarity and maps similar data blocks to an identical cache entry (Miguel et al., 2016). The inter-block similarity used in the above caches is orthogonal to the pixel-level similarity of our work. Flikker reduces the refresh rates of DRAM portion containing error-tolerant data (Liu et al., 2011). Bidirectional precision scaling is proposed to compress the data to be written to DRAM (Ranjan et al., 2017a). However, indiscriminately reducing the precision of all data can significantly decrease the image quality. Unlike them, SimCom exploits the pixel-level similarity in bitmaps and efficiently reduce the write of similar words in NVMs with minor quality loss.

7. Conclusion

Bit-write reduction in NVM is important for the performance of NVM-based main memory. By exploiting the error-tolerance and similarity in bitmaps, SimCom efficiently reduces the writes of similar words in write accesses to NVMs on-the-fly. Due to the flexibility and efficiency of approximate compression, SimCom delivers higher performance than state-of-the-art compression schemes with slight programmer annotations.


  • (1)
  • Baek and Chilimbi (2010) Woongki Baek and Trishul M. Chilimbi. 2010. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proc. PLDI. 198–209. https://doi.org/10.1145/1806596.1806620
  • Barker et al. (2013) Kevin Barker, Thomas Benson, Dan Campbell, David Ediger, Roberto Gioiosa, Adolfy Hoisie, Darren Kerbyson, Joseph Manzano, Andres Marquez, Leon Song, et al. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute (2013).
  • Binkert et al. (2011) Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 Simulator. SIGARCH Computer Architecture News 39, 2 (2011), 1–7. https://doi.org/10.1145/2024716.2024718
  • Cho and Lee (2009) Sangyeun Cho and Hyunjin Lee. 2009. Flip-N-Write: A Simple Deterministic Technique to Improve PRAM Write Performance, Energy and Endurance. In Proc. MICRO. ACM, 347–357. https://doi.org/10.1145/1669112.1669157
  • Condit et al. (2009) Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin C. Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O Through Byte-Addressable, Persistent Memory. In Proc. SOSP. ACM, 133–146. https://doi.org/10.1145/1629575.1629589
  • David et al. (2018) Tudor David, Aleksandar Dragojevic, Rachid Guerraoui, and Igor Zablotchi. 2018. Log-Free Concurrent Data Structures. In Proc. ATC. USENIX Association, 373–386. https://www.usenix.org/conference/atc18/presentation/david
  • Dgien et al. (2014) David B. Dgien, Poovaiah M. Palangappa, Nathan Altay Hunter, Jiayin Li, and Kartik Mohanram. 2014. Compression Architecture for Bit-write Reduction in Non-volatile Memory Technologies. In Proc. NANOARCH. ACM, 51–56. https://doi.org/10.1109/NANOARCH.2014.6880482
  • Dufaux et al. (2009) Frédéric Dufaux, Gary J Sullivan, and Touradj Ebrahimi. 2009. The JPEG XR Image Coding Standard [Standards in a Nutshell]. IEEE Signal Processing Magazine 26, 6 (2009).
  • Duff (2017) Tom Duff. 2017. Deep Compositing Using Lie Algebras. ACM Trans. Graph. 36, 3 (2017), 26:1–26:12. https://doi.org/10.1145/3023386
  • Esmaeilzadeh et al. (2012) Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture Support for Disciplined Approximate Programming. In Proc. ASPLOS. 301–312. https://doi.org/10.1145/2150976.2151008
  • Franzen ([n. d.]) Rich Franzen. [n. d.]. Kodak Lossless True Color Image Suite. Retrieved April 10, 2019 from http://r0k.us/graphics/kodak/
  • Guo et al. (2016) Qing Guo, Karin Strauss, Luis Ceze, and Henrique S. Malvar. 2016. High-Density Image Storage Using Approximate Memory Cells. In Proc. ASPLOS. 413–426. https://doi.org/10.1145/2872362.2872413
  • Guo et al. (2018) Yuncheng Guo, Yu Hua, and Pengfei Zuo. 2018. DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory. In Proc. DATE. IEEE, 1622–1627. https://doi.org/10.23919/DATE.2018.8342274
  • Hong et al. (2018) Seokin Hong, Prashant Jayaprakash Nair, Bülent Abali, Alper Buyuktosunoglu, Kyu-Hyoun Kim, and Michael B. Healy. 2018. Attaché: Towards Ideal Memory Compression by Mitigating Metadata Bandwidth Overheads. In Proc. MICRO. IEEE, 326–338. https://doi.org/10.1109/MICRO.2018.00034
  • Hwang et al. (2018) Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Beomseok Nam. 2018. Endurable Transient Inconsistency in Byte-Addressable Persistent B+-Tree. In Proc. FAST. USENIX Association, 187–200. https://www.usenix.org/conference/fast18/presentation/hwang
  • Jacobvitz et al. (2013) Adam N. Jacobvitz, A. Robert Calderbank, and Daniel J. Sorin. 2013. Coset Coding to Extend the Lifetime of Memory. In Proc. HPCA. IEEE Computer Society, 222–233. https://doi.org/10.1109/HPCA.2013.6522321
  • Jevdjic et al. (2017) Djordje Jevdjic, Karin Strauss, Luis Ceze, and Henrique S. Malvar. 2017. Approximate Storage of Compressed and Encrypted Videos. In Proc. ASPLOS. 361–373. https://doi.org/10.1145/3037697.3037718
  • Judd (1975) Deane B Judd. 1975. Color in Business, Science and Industry. Wiley-Interscience.
  • Laurenzano et al. (2016) Michael A. Laurenzano, Parker Hill, Mehrzad Samadi, Scott A. Mahlke, Jason Mars, and Lingjia Tang. 2016. Input Responsiveness: Using Canary Inputs to Dynamically Steer Approximation. In Proc. PLDI. 161–176. https://doi.org/10.1145/2908080.2908087
  • Lee et al. (2009) Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting Phase Change Memory as a Scalable DRAM Alternative. In Proc. ISCA. ACM, 2–13. https://doi.org/10.1145/1555754.1555758
  • Leong (2006) J Leong. 2006. Number of colors distinguishable by the human eye. Hypertextbook,(ed.). Wyszecki, Gunter. Color. Chicago: World Book Inc 824 (2006).
  • Li et al. (2013) Zhongqi Li, Ruijin Zhou, and Tao Li. 2013. Exploring High-Performance and Energy Proportional Interface for Phase Change Memory Systems. In Proc. HPCA. IEEE Computer Society, 210–221. https://doi.org/10.1109/HPCA.2013.6522320
  • Liu et al. (2011) Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM Refresh-power through Critical Data Partitioning. In Proc. ASPLOS. 213–224. https://doi.org/10.1145/1950365.1950391
  • Lowe (2004) David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (2004), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  • Miguel et al. (2016) Joshua San Miguel, Jorge Albericio, Natalie D. Enright Jerger, and Aamer Jaleel. 2016. The Bunker Cache for Spatio-Value Approximation. In Proc. MICRO. 43:1–43:12. https://doi.org/10.1109/MICRO.2016.7783746
  • Miguel et al. (2015) Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie D. Enright Jerger. 2015. Doppelgänger: A Cache for Approximate Computing. In Proc. MICRO. 50–61. https://doi.org/10.1145/2830772.2830790
  • Miguel et al. (2014) Joshua San Miguel, Mario Badr, and Natalie D. Enright Jerger. 2014. Load Value Approximation. In Proc. MICRO. 127–139. https://doi.org/10.1109/MICRO.2014.22
  • Palangappa and Mohanram (2016) Poovaiah M. Palangappa and Kartik Mohanram. 2016. CompEx: Compression-Expansion Coding for Energy, Latency, and Lifetime Improvements in MLC/TLC NVM. In Proc. HPCA. IEEE Computer Society, 90–101. https://doi.org/10.1109/HPCA.2016.7446056
  • Palangappa and Mohanram (2018) Poovaiah M. Palangappa and Kartik Mohanram. 2018. CASTLE: Compression Architecture for Secure Low Latency, Low Energy, High Endurance NVMs. In Proc. DAC. ACM, 87:1–87:6. https://doi.org/10.1145/3195970.3196007
  • Pekhimenko et al. (2012) Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. In Proc. PACT. 377–388. https://doi.org/10.1145/2370816.2370870
  • Poremba et al. (2015) Matthew Poremba, Tao Zhang, and Yuan Xie. 2015. NVMain 2.0: A User-Friendly Memory Simulator to Model (Non-)Volatile Memory Systems. CAL 14, 2 (2015), 140–143. https://doi.org/10.1109/LCA.2015.2402435
  • Porter and Duff (1984) Thomas K. Porter and Tom Duff. 1984. Compositing Digital Images. In Proc. SIGGRAPH. ACM, 253–259. https://doi.org/10.1145/800031.808606
  • Ranjan et al. (2017a) Ashish Ranjan, Arnab Raha, Vijay Raghunathan, and Anand Raghunathan. 2017a. Approximate Memory Compression for Energy-efficiency. In Proc. ISLPED. 1–6. https://doi.org/10.1109/ISLPED.2017.8009173
  • Ranjan et al. (2017b) Ashish Ranjan, Swagath Venkataramani, Zoha Pajouhi, Rangharajan Venkatesan, Kaushik Roy, and Anand Raghunathan. 2017b. STAxCache: An Approximate, Energy Efficient STT-MRAM Cache. In Proc. DATE. 356–361. https://doi.org/10.23919/DATE.2017.7927016
  • Rublee et al. (2011) Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R. Bradski. 2011. ORB: an efficient alternative to SIFT or SURF. In Proc. ICCV. IEEE Computer Society, 2564–2571. https://doi.org/10.1109/ICCV.2011.6126544
  • Samadi et al. (2014) Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott A. Mahlke. 2014. Paraprox: Pattern-Based Approximation for Data Parallel Applications. In Proc. ASPLOS. 35–50. https://doi.org/10.1145/2541940.2541948
  • Sampson et al. (2015) Adrian Sampson, André Baixo, Benjamin Ransford, Thierry Moreau, Joshua Yip, Luis Ceze, and Mark Oskin. 2015. ACCEPT: A Programmer-Guided Compiler Framework for Practical Approximate Computing. University of Washington Technical Report UW-CSE-15-01 1 (2015).
  • Sampson et al. (2011) Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proc. PLDI. 164–174. https://doi.org/10.1145/1993498.1993518
  • Sampson et al. (2013) Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2013. Approximate Storage in Solid-State Memories. In Proc. MICRO. 25–36. https://doi.org/10.1145/2540708.2540712
  • Shin et al. (2017) Seunghee Shin, Satish Kumar Tirukkovalluri, James Tuck, and Yan Solihin. 2017. Proteus: A Flexible and Fast Software Supported Hardware Logging approachs for NVM. In Proc. MICRO. ACM, 178–190. https://doi.org/10.1145/3123939.3124539
  • Sui et al. (2016) Xin Sui, Andrew Lenharth, Donald S. Fussell, and Keshav Pingali. 2016. Proactive Control of Approximate Programs. In Proc. ASPLOS. ACM, 607–621. https://doi.org/10.1145/2872362.2872402
  • Wallace (1991) Gregory K. Wallace. 1991. The JPEG Still Picture Compression Standard. Commun. ACM 34, 4 (1991), 30–44. https://doi.org/10.1145/103085.103089
  • Xu et al. (2018a) Jie Xu, Dan Feng, Yu Hua, Wei Tong, Jingning Liu, and Chunyan Li. 2018a. Extending the Lifetime of NVMs with Compression. In Proc. DATE. IEEE, 1604–1609. https://doi.org/10.23919/DATE.2018.8342271
  • Xu et al. (2017) Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. 2017. NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System. In Proc. SOSP. ACM, 478–496. https://doi.org/10.1145/3132747.3132761
  • Xu et al. (2018b) Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Sasa Misailovic, and Saurabh Bagchi. 2018b. VideoChef: Efficient Approximation for Streaming Video Processing Pipelines. In Proc. ATC. USENIX Association, 43–56. https://www.usenix.org/conference/atc18/presentation/xu-ran
  • Yan et al. (2017) Eddie Q. Yan, Kaiyuan Zhang, Xi Wang, Karin Strauss, and Luis Ceze. 2017. Customizing Progressive JPEG for Efficient Image Storage. In Proc. HotStorage. https://www.usenix.org/conference/hotstorage17/program/presentation/yan
  • Yang et al. (2007) Byung-Do Yang, Jae-Eun Lee, Jang-Su Kim, Junghyun Cho, Seung-Yun Lee, and Byoung-Gon Yu. 2007. A Low Power Phase-Change Random Access Memory using a Data-Comparison Write Scheme. In Proc. ISCAS. IEEE, 3014–3017. https://doi.org/10.1109/ISCAS.2007.377981
  • Yazdanbakhsh et al. (2017) Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotfi-Kamran. 2017. AxBench: A Multiplatform Benchmark Suite for Approximate Computing. IEEE Design & Test 34, 2 (2017), 60–68. https://doi.org/10.1109/MDAT.2016.2630270
  • Young et al. (2018) Vinson Young, Sanjay Kariyappa, and Moinuddin K. Qureshi. 2018. CRAM: Efficient Hardware-Based Memory Compression for Bandwidth Enhancement. CoRR abs/1807.07685 (2018). arXiv:1807.07685 http://arxiv.org/abs/1807.07685
  • Yue and Zhu (2013) Jianhui Yue and Yifeng Zhu. 2013. Accelerating Write by Exploiting PCM Asymmetries. In Proc. HPCA. IEEE Computer Society, 282–293. https://doi.org/10.1109/HPCA.2013.6522326
  • Zhao et al. (2017) Hengyu Zhao, Linuo Xue, Ping Chi, and Jishen Zhao. 2017. Approximate Image Storage with Multi-level Cell STT-MRAM Main Memory. In Proc. ICCAD. 268–275. https://doi.org/10.1109/ICCAD.2017.8203788
  • Zhou et al. (2009) Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. A Durable and Energy Efficient Main Memory Using Phase Change Memory Technology. In Proc. ISCA. ACM, 14–23. https://doi.org/10.1145/1555754.1555759
  • Zuo et al. (2018) Pengfei Zuo, Yu Hua, and Jie Wu. 2018. Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory. In Proc. OSDI. USENIX Association, 461–476. https://www.usenix.org/conference/osdi18/presentation/zuo