Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

11/30/2016 ∙ by Vivek Seshadri, et al. ∙ 0

Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1 Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X---25.6X improvement in raw throughput and 25.1X---59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations: 1) bitmap indices, 2) BitWeaving, and 3) bitvector-based implementation of sets. Our evaluations show that Buddy significantly outperforms the state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bitwise operations are an important component of modern day programming [36, 67]. In this paper, we aim to improve the performance and efficiency of bitwise operations on large amounts of data, or bulk bitwise operations, which are triggered by many applications. For instance, in databases, bitmap indices [18, 51], which heavily use bitwise operations, were shown to be more efficient than B-trees for many queries [18, 5, 70]. Many real-world databases [52, 8, 5, 9] support bitmap indices. Again in databases, a recent work, BitWeaving [47], proposes a technique to accelerate database scans completely using bitwise operations. In bioinformatics, prior works have proposed techniques to exploit bitwise operations to accelerate DNA sequence alignment [15, 71]. Bitwise operations are also prevalent in encryption algorithms [66, 26, 50], graph processing [46], approximate statistics [17], and networking workloads [67]. Thus, accelerating such bulk bitwise operations can significantly boost the performance and efficiency of a number of important applications.

In existing systems, a bulk bitwise operation requires a large amount of data to be transferred back and forth on the memory channel between main memory and the processor. Such large data transfers result in high latency, bandwidth, and energy consumption. In fact, our experiments on an Intel Skylake [1] system and an NVIDIA GeForce GTX 745 [6] system show that the available memory bandwidth of these systems is the performance bottleneck that limits the throughput of bulk bitwise operations (Section 7).

In this paper, we propose a new mechanism to perform bulk bitwise operations completely inside main memory (DRAM), without wasting memory bandwidth, improving both performance and efficiency. We call our mechanism BUdDy-RAM (Bitwise operations Using Dynamic RAM) or just Buddy. Buddy consists of two parts, Buddy-AND/OR and Buddy-NOT. Both parts rely on the operation of the sense amplifier that is used to extract data from the DRAM cells.

Buddy-AND/OR exploits the fact that a DRAM chip consists of many subarrays [35, 19, 63]. In each subarray, many rows of DRAM cells (typically 512 or 1024) share a single row of sense amplifiers. We show that simultaneously activating three rows (rather than one) results in a bitwise majority function, i.e., in each column of cells, at least two cells have to be fully charged for the corresponding sense amplifier to detect a logical “1”. We refer to this operation as triple-row activation. We show that by controlling the initial value of one of the three rows, we can use the triple-row activation to perform a bitwise AND or OR operation of the remaining two rows. Buddy-NOT exploits the fact that each sense amplifier consists of two inverters. We use a dual-contact cell (a 2-transistor 1-capacitor cell [31]) that connects to both sides of the inverters to efficiently perform a bitwise NOT of the value of any cell connected to the sense amplifier. Our SPICE simulations results show that both Buddy-AND/OR and Buddy-NOT work reliably, even in the presence of significant process variation. Sections 3 and 4 present these two components in full detail.

Combining Buddy-AND/OR and Buddy-NOT, Buddy is functionally complete, and can perform any bitwise logical operation. Since DRAM internally operates at row granularity, both Buddy-AND/OR and Buddy-NOT naturally operate at the same granularity, i.e., an entire row of DRAM cells (multiple kilobytes across a module). As a result, Buddy can efficiently perform any multi-kilobyte-wide bitwise operation.

A naive implementation of Buddy would lead to high cost and complexity. For instance, supporting triple-row activation on any three arbitrary DRAM rows requires the replication of the address bus and the row decoder, as these structures have to communicate and decode three addresses simultaneously. In this work, we present a practical, low-cost implementation of Buddy, which heavily exploits the existing DRAM operation and command interface. First, our implementation allows the memory controller to perform triple-row activation only on a designated set of three rows (chosen at design time) in each DRAM subarray. To perform Buddy AND/OR operation on two arbitrary DRAM rows, the data in those rows are first copied to the designated set of rows that are capable of triple-row activation, and the result of the operation is copied out of the designated rows to the destination. We exploit a recent work, RowClone [63], which enables very fast data copying between two DRAM rows, to perform the required copy operations efficiently. Second, we logically split the row decoder in each DRAM subarray into two parts, one part to handle activations related to only the designated rows, and another to handle activations of the regular data rows. Finally, we introduce a simple address mapping mechanism that avoids any changes to the DRAM command interface. In fact, our implementation introduces no new DRAM commands. Putting together all these techniques, we evaluate the area cost of our proposal to be equivalent to 10 DRAM rows per subarray, which amounts to less than 1% of DRAM area (as shown in Section 5.4).

By performing bulk bitwise operations completely in DRAM, Buddy significantly improves the performance and reduces the energy consumption of these operations. In fact, as Buddy does not require any data to be read out of the DRAM banks, each individual Buddy operation is contained entirely inside a DRAM bank. Since DRAM chips have multiple DRAM banks to overlap the latencies of different memory requests, we can perform multiple Buddy operations concurrently in different banks. As a result, the performance of Buddy scales linearly with the number of DRAM banks in the memory system, and is not limited by the off-chip pin bandwidth of the processor. We evaluate these benefits by comparing the raw throughput and energy of performing bulk bitwise operations using Buddy to an Intel Skylake [1] system and an NVIDIA GTX 745 [6] system. Our evaluations show that the bitwise operation throughput of both of these systems is limited by the off-chip memory bandwidth. Averaged across seven commonly-used bitwise operations, Buddy, even when using only a single DRAM bank, improves bitwise operation throughput by 3.8X—9.1X compared to the Skylake, and 2.7X—6.4X compared to the GTX 745. Buddy reduces DRAM energy consumption of these bitwise operations by 25.1X—59.5X (Section 7). In addition to these benefits, Buddy frees up significant processing capacity, cache resources, and memory bandwidth for other co-running applications, thereby reducing both computation-unit and memory interference caused by bulk bitwise operations and thus enabling better overall system performance.

Figure 2: DRAM cell and sense amplifier
Figure 3: State transitions involved in DRAM cell activation

We evaluate three data-intensive applications to demonstrate Buddy’s benefits in comparison to a state-of-the-art baseline processor that performs bitwise operations using SIMD extensions. First, we show that Buddy improves end-to-end performance of queries performed on database bitmap indices by 6.0X, on average across a range of query parameters. Second, Buddy improves the performance of BitWeaving [47], a recently proposed technique to accelerate column scan operations in databases, by 7.0X, on average across a range of scan parameters. Third, for the commonly-used set data structure, Buddy improves performance of set intersection, union, and difference operations by 3.0X compared to conventional implementations [24]. Section 8 describes our simulation framework [16], workloads, results, and four other potential use cases for Buddy: bulk masked initialization, encryption algorithms, DNA sequence mapping, and approximate statistics.

We make the following contributions.

  • [leftmargin=*,topsep=2pt]

  • To our knowledge, this is the first work that proposes a low-cost mechanism to perform bulk bitwise operations completely within a DRAM chip. We introduce Buddy, a mechanism that exploits the analog operation of DRAM to perform any row-wide bitwise operation efficiently using DRAM.111The novelty of our approach is confirmed by at least one cutting-edge DRAM design team in industry [33]. The closest work we know of are patents by Mikamonu [14]. The mechanisms presented in these patents are significantly different and costlier than our approach. We discuss this work in Section 9.

  • We present a low-cost implementation of Buddy, which requires modest changes to the DRAM architecture. We verify our implementation of Buddy with rigorous SPICE simulations. The cost of our implementation is 1% of the DRAM chip area, and our implementation requires no new DRAM commands. (Section 5)

  • We evaluate the benefits of Buddy on both 1) raw throughput/energy of seven commonly-used bulk bitwise operations and 2) three data-intensive real workloads that make heavy use of such operations. Our extensive results show that Buddy significantly outperforms the state-of-the-art approach of performing such operations in the SIMD units of a CPU or in the execution units of a GPU. We show that the large improvements in raw throughput of bitwise operations translate into large improvements (3.0X-7.0X) in the performance of three data-intensive workloads. (Section 8)

2 Background on DRAM Operation

DRAM-based memory consists of a hierarchy of structures with channels, modules, and ranks at the top. Each rank consists of a number of chips that operate in unison. Hence, a rank can be logically viewed as a single wide DRAM chip. Each rank is further divided into many banks. All access-related commands are directed towards a specific bank. Each bank consists of several subarrays and peripheral logic to process commands [35, 19, 75, 63]. Each subarray consists of many rows (typically 512/1024) of DRAM cells, a row of sense amplifiers, and a row address decoder. Figure 1 shows the logical organization of a subarray.

Figure 1: DRAM subarray

At a high level, accessing data from a subarray involves three steps. The first step, called row activation, copies data from a specified row of DRAM cells to the corresponding row of sense amplifiers. This step is triggered by the ACTIVATE command. Then, data is accessed from the sense amplifiers using a READ or WRITE command. Each READ or WRITE accesses only a subset of sense amplifiers. Once a row is activated, multiple READ and WRITE commands can be issued to that row. The bank is then prepared for a new row access by performing an operation called precharging. This step is triggered by a PRECHARGE command. We now explain these operations in detail by focusing our attention on a single DRAM cell and a sense amplifier.

2.1 DRAM Cell and Sense Amplifier

Figure 3 shows the connection between a DRAM cell and a sense amplifier. Each DRAM cell consists of 1) a capacitor, and 2) an access transistor that controls access to the cell. Each sense amplifier consists of two inverters, and an enable signal. The output of each inverter is connected to the input of the other inverter. The wire that connects the cell to the sense amplifier is called the bitline, and the wire that controls the access transistor is called the wordline. We refer to the wire on the other end of the sense amplifier as (“bitline bar”).

2.2 DRAM Cell Operation

Figure 3 shows the state transitions involved in extracting the state of the DRAM cell. In this figure, we assume that the cell is initially in the charged state. The operation is similar if the cell is initially empty. In the initial precharged state ➊, both the bitline and are maintained at a voltage level of V. The sense amplifier and the wordline are disabled.

The ACTIVATE command triggers an access to the cell. Upon receiving the ACTIVATE, the wordline of the cell is raised ➋, connecting the cell to the bitline. Since the capacitor is fully charged, and therefore at a higher voltage level than the bitline, charge flows from the capacitor to the bitline until both the capacitor and the bitline reach the same voltage level V ➌. This phase is called charge sharing. After the charge sharing is complete, the sense amplifier is enabled ➍. The sense amplifier detects the deviation in the voltage level of the bitline (by comparing it with the voltage level on the ). The sense amplifier then amplifies the deviation to the stable state where the bitline is at the voltage level of V (and the is at 0). Since the capacitor is still connected to the bitline, the capacitor also gets fully charged ➎. If the capacitor was initially empty, then the deviation on the bitline would be negative (towards ), and the sense amplifier would drive the bitline to . Each ACTIVATE command operates on an entire row of cells (typically 8 KB of data across a rank).

After the cell is activated, data can be accessed from the bitline by issuing a READ or WRITE to the column containing the cell (not shown in Figure 3). Once the data is accessed, the subarray needs to be taken back to the initial precharged state ➊. This is done by issuing a PRECHARGE command. Upon receiving this command, DRAM first lowers the raised wordline, thereby disconnecting the capacitor from the bitline. After this, the sense amplifier is disabled, and both the bitline and the are driven to the voltage level of V.

Our mechanism, Buddy, exploits several aspects of modern DRAM operation to efficiently perform bitwise operations completely inside DRAM with low cost. In the following sections, we describe the two components of Buddy in detail.

3 Buddy AND/OR

The first component of Buddy to perform bitwise AND and OR operations exploits two facts about the DRAM operation.

  1. [topsep=2pt]

  2. Within a DRAM subarray, each sense amplifier is shared by many DRAM cells (typically 512 or 1024).

  3. The final state of the bitline after sense amplification depends primarily on the voltage deviation on the bitline after the charge sharing phase.

Based on these facts, we observe that simultaneously activating three cells, rather than a single cell, results in a majority function—i.e., at least two cells have to be fully charged for the final state to be a logical “1”. We refer to such a simultaneous activation of three cells (or rows) as triple-row activation. We now conceptually describe triple-row activation and how we use it to perform bitwise AND and OR operations.

3.1 Triple-Row Activation (TRA)

A triple-row activation (TRA) simultaneously connects each sense amplifier with three DRAM cells. For ease of conceptual understanding, let us assume that all cells have the same capacitance, the transistors and bitlines behave ideally (no resistance), and the cells start at a fully refreshed state. Then, based on charge sharing principles [34], the bitline deviation at the end of the charge sharing phase of the TRA is,

(1)

where, is the bitline deviation, is the cell capacitance, is the bitline capacitance, and is the number of cells in the fully charged state. It is clear that if and only if . In other words, the bitline deviation will be positive if and it will be negative if . Therefore, we expect the final state of the bitline to be V if at least two of the three cells are initially full charged, and the final state to be , if at least two of the three cells are initially fully empty.

Figure 4 shows an example of TRA where two of the three cells are initially in the charged state ➊. When the wordlines of all the three cells are raised simultaneously ➋, charge sharing results in a positive deviation on the bitline. Therefore, after sense amplification ➌, the sense amplifier drives the bitline to V, and as a result, fully charges all the three cells.

Figure 4: Triple-row activation

If , , and represent the logical values of the three cells, then the final state of the bitline is (the majority function). Importantly, we can rewrite this expression as . In other words, by controlling the value of the cell , we can use TRA to execute a bitwise AND or bitwise OR of the cells and . Due to the regular bulk operation of cells in DRAM, this approach naturally extends to an entire row of DRAM cells and sense amplifiers, enabling a multi-kilobyte-wide bitwise AND/OR operation.

3.2 Making TRA Work

There are five issues with TRA that we need to resolve for it to be implementable in a real design.

  1. [labelindent=0pt,topsep=4pt,leftmargin=*]

  2. The deviation on the bitline with three cells may not be large enough to be detected by the sense amplifier or it could lengthen the sense amplification process.

  3. Equation 1 assumes that all cells have the same capacitance. However, due to process variation, different cells will have different capacitance. This can affect the reliability of TRA.

  4. As shown in Figure 4 (state ➌), the TRA overwrites the data of all the three cells with the final value. In other words, TRA modifies all source data.

  5. Equation 1 assumes that the cells involved in a TRA are either fully charged or fully empty. However, DRAM cells leak charge over time. Therefore, TRA may not operate as expected, if the cells involved had leaked charge.

  6. Simultaneously activating three arbitrary rows inside a DRAM subarray requires the memory controller to communicate three row addresses to the DRAM module and the subarray row decoder to simultaneously decode three row addresses. This introduces an enormous cost on the address bus and the row decoder.

We address the first two issues by performing rigorous and detailed SPICE simulations of TRA. The results of these confirm that TRA works as expected even with process variation. Section 3.3 presents these results. We propose a simple implementation of Buddy AND/OR that addresses all of the last three issues. We describe this implementation in Section 3.4. Section 5 describes the final implementation of Buddy.

Figure 5: A dual-contact cell connected to both ends of a sense amplifier
Figure 6: Bitwise NOT using a dual contact capacitor

3.3 SPICE Simulations

We perform SPICE simulations to confirm the reliability of TRA. We implement the DRAM sense amplifier circuit using 55nm DDR3 model parameters [57], and PTM low-power transistor models [56, 76]. We use cell/transistor parameters from the Rambus power model [57] (cell capacitance = 22fF, transistor width = 55nm, transistor height = 85nm).

The DRAM specification is designed for the worst case conditions when a cell has not been accessed for a long time (refresh interval). Therefore, we can activate a fully refreshed cell with a latency much lower than the standard activation latency, 35ns [27, 65]. In fact, we observe that we can activate a fully charged/empty cell in 20.9 ns/13.5 ns.

For TRA, there are four possible cases depending on the number cells that are initially fully charged. For these cases, we add different levels of process variation among cells, so that the strong cell attempts to override the majority decision of the two weak cells. Table 1 shows the latency of TRA for the four possible cases with different levels of process variation, where s and w subscripts stand for strong and weak, respectively, for the cells that store either 0 or 1.

Variation 0% 5% 10% 15% 20% 25%
16.4 16.3 16.3 16.4 16.3 16.2
18.3 18.6 18.8 19.1 19.7 Fail
24.9 25.0 25.2 25.3 25.4 25.7
22.5 22.3 22.2 22.2 22.2 22.1
Table 1: Effect of process variation on the latency of triple-row activation. All times are in ns.

We draw three conclusions. First, for the cases where all three cells are either 0 or 1, the latency of TRA is stable at around 16 ns and 22 ns respectively, even in the presence of process variation. This is because, in these two cases, all three cells push the bitline toward the same direction. Second, for the other two cases, while the latency of TRA increases with increasing process variation, it is still well within the DRAM specification even with 20% process variation (i.e., a 40% difference in cell capacitance). Third, we observe the first failure at 25% for the case. In this case, the sense amplifier operates incorrectly by detecting a “1” instead of “0”. In summary, our SPICE simulations show that TRA works as expected even in the presence of significant process variation.

Prior works [41, 48, 73, 59] show that temperature increases DRAM cell leakage. However, as we show in the next section, our mechanism always ensures that the cells involved in the TRA are fully refreshed just before the operation. Therefore, we do not expect temperature to affect the reliability of TRA.

3.4 Implementation of Buddy AND/OR

To avoid modification of the source data (issue 3), our implementation reserves a set of designated rows in each subarray that will be used to perform TRAs. These designated rows are chosen statically at design time. To perform bitwise AND or OR operation on two arbitrary sources rows, our mechanism first copies the data of the source rows into the designated rows and performs the required TRA on the designated rows. Our final implementation (Section 5) reserves four designated rows in each subarray (T0T3). As an example, to perform a bitwise AND of two rows A and B, and store the result in row R, our mechanism performs the following steps.

  1. [topsep=4pt]

  2. Copy data of row A to row T0

  3. Copy data of row B to row T1

  4. Initialize row T2 to

  5. Activate rows T0, T1, and T2 simultaneously

  6. Copy data of row T0 to row R

This implementation allows us to address the last three issues described in Section 3.2. First, by not performing the TRA directly on the source data, our mechanism trivially avoids modification of the source data (issue 3). Second, each copy operation takes five orders of magnitude lower latency (1 ) than the refresh interval (64 ms). Since these copy operations are performed just before the TRA, the rows involved in the TRA are very close to the fully refreshed state just before the operation (addressing issue 4). Finally, since the designated rows are chosen at design time, the memory controller can use a reserved address to communicate TRA of a pre-defined set of three designated rows. To this end, our mechanism reserves a set of row addresses just to control the designated rows. While some of these addresses perform single row activation of the designated rows (necessary for the copy operations), others trigger TRAs of pre-defined sets of designated rows. For instance, in our final implementation (Section 5), to perform a TRA of designated rows T0, T1, and T2 (step 4, above), the memory controller simply issues a ACTIVATE with the reserved address B12. The row decoder maps B12 to all the three wordlines of rows T0, T1, and T2. This mechanism requires no changes to the address bus and significantly reduces the cost and complexity of the row decoder compared to performing TRA on three arbitrary rows (addressing issue 5).

3.5 Mitigating the Overhead of Copy Operations

Our mechanism needs a set of copy and initialization operations to copy the source data into the designated rows and copy the result back to the destination. These copy operations, if performed naively, will nullify the benefits of our mechanism. Fortunately, a recent work, RowClone [63], has proposed two techniques to copy data between two rows quickly and efficiently within DRAM. The first technique, RowClone-FPM (Fast Parallel Mode), copies data within a subarray by issuing two back-to-back ACTIVATEs to the source row and the destination row. The second technique, RowClone-PSM (Pipelined Serial Mode), copies data between two banks by using the shared internal bus to overlap the read to the source bank with the write to the destination bank.

With RowClone, we can perform all the copy operations and the initialization operation efficiently within DRAM. To use RowClone for the initialization operation, we reserve two additional control rows, C0 and C1. C0 is pre-initialized to and C1 is pre-initialized to 1. Depending on the operation to be performed, our mechanism uses RowClone to copy either C0 or C1 to the appropriate designated row.

In the best case, when all the three rows involved in the operation are in the same subarray, our mechanism uses RowClone-FPM for all copy and initialization operations. However, if the three rows are in different subarrays, some of the three copy operations have to use RowClone-PSM. In the worst case, when all three copy operations have to use RowClone-PSM, our approach would consume higher latency than the baseline. However, when only one or two RowClone-PSM operations are required, our mechanism is faster and more energy-efficient than existing systems.

4 Buddy NOT

Buddy NOT exploits the following fact that at the end of the sense amplification process, the voltage level of the contains the negation of the logical value of the cell. Our key idea to perform bitwise NOT in DRAM is to transfer the data on the to a cell that can be connected to the bitline. For this purpose, we introduce the dual-contact cell. A dual-contact cell (DCC) is a DRAM cell with two transistors (a 2T-1C cell similar to the one described in [31]). For each DCC, one transistor connects the DCC to the bitline and the other transistor connects the DCC to the . We refer to the wordline that controls the connection between the DCC and the bitline as the d-wordline (or data wordline). We refer to the wordline that controls the connection between the DCC and the as the n-wordline (or negation wordline). Figure 6 shows a DCC connected to a sense amplifier.

Figure 6 shows the steps involved in transferring the negation of a source cell on to the DCC connected to the same sense amplifier ➊. Our mechanism first activates the source cell ➋. The activation process drives the bitline to the data corresponding to the source cell, V in this case ➌. More importantly, for the purpose of our mechanism, it drives the to . In this state, our mechanism activates the n-wordline, enabling the transistor that connects the DCC to the  ➍. Since the is already at a stable voltage level of , it overwrites the value in the DCC with , essentially copying the negation of the source data into the DCC. After this step, we can efficiently copy the negated data into the destination cell using RowClone.

SPICE Simulations. We perform detailed SPICE simulation of the dual-contact cell with the same cell and transistor parameters described in Section 3.3. Our simulation results confirm that the DCC operation described in Figure 6 works as expected. We do not present details due to lack of space.

Implementation of Buddy NOT. Our implementation adds two rows of DCCs to each subarray, one on each side of the row of sense amplifiers. Similar to the designated rows used for Buddy AND/OR (Section 3.4), the memory controller uses reserved row addresses to control the d-wordlines and n-wordlines of the DCC rows—e.g., in our final implementation (Section 5), address B5 maps to the n-wordline of the first DCC row. To perform a bitwise NOT of row A and store the result in row R, the memory controller performs the following steps.

  1. [topsep=2pt]

  2. Activate row A

  3. Activate n-wordline of DCC (address B5)

  4. Precharge the bank.

  5. Copy data from d-wordline of DCC to row R

Similar to the copy operations in Buddy AND/OR (Section 3.5), the copy operation in Step 4 above be efficiently performed using RowClone.

5 Buddy: Putting It All Together

In this section, we describe how we integrate Buddy AND/OR and Buddy NOT into a single mechanism that can perform any bitwise operation efficiently inside DRAM. First, both Buddy AND/OR and Buddy NOT reserve a set of rows in each subarray and a set of addresses that map to these rows. We present the final set of reserved addresses and their mapping in detail (Section 5.1). Second, we introduce a new primitive called AAP (ACTIVATE-ACTIVATE-PRECHARGE) that the memory controller uses to execute various bitwise operations (Section 5.2). Finally, we describe an optimization that lowers the latency of the AAP primitive, further improving the performance of Buddy (Section 5.3).

5.1 Row Address Grouping

Our implementation divides the space of row addresses in each subarray into three distinct groups (Figure 7): 1) bitwise group, 2) control group, and 3) data group.

Figure 7: Row address grouping. The figure shows how the B-group row decoder (Section 5.3) simultaneously activates rows T0, T1, and T2 with a single address B12.

The B-group (or the bitwise group) corresponds to the addresses used to perform the bitwise operations. This group contains eight physical wordlines: four corresponding to the designated rows (T0T3) used to perform triple-row activations (Section 3.4) and the remaining four corresponding to the d-and-n-wordlines that control the two rows of dual-contact cells (Section 4). We refer to the d-wordlines of the two rows as DCC0 and DCC1, and the corresponding n-wordlines as and . The B-group contains 16 reserved addresses: B0B15. Table 2 lists the mapping between the 16 addresses and the wordlines. The first eight addresses individually activate each of the 8 wordlines in the group. Addresses B12B15 activate three wordlines simultaneously. Buddy uses these addresses to trigger triple-row activations. Finally, addresses B8B11 activate two wordlines. As we will show in the next section, Buddy uses these addresses to copy the result of an operation simultaneously to two rows (e.g., zero out two rows simultaneously).222An implementation can reserve more rows for the B-group. While this will reduce the number of rows available to store application data, it can potentially reduce the number of copy operations required to implement different sequences of bitwise operations.

Addr. Wordline(s)
B0 T0
B1 T1
B2 T2
B3 T3
B4 DCC0
B5
B6 DCC1
B7
Addr. Wordline(s)
B8 , T0
B9 , T1
B10 T2, T3
B11 T0, T3
B12 T0, T1, T2
B13 T1, T2, T3
B14 DCC0, T1, T2
B15 DCC1, T0, T3
Table 2: Mapping of B-group addresses

The C-group (or the control group) contains the two pre-initialized rows for controlling the bitwise AND/OR operations (Section 3.5). Specifically, this group contains two addresses: C0 (all zeros) and C1 (all ones).

The D-group (or the data group) corresponds to the rows that store regular data. This group contains all the addresses that are neither in the B-group nor in the C-group. Specifically, if each subarray contains rows, then the D-group contains addresses, labeled D0D1005. Buddy exposes only the D-group addresses to the operating system (OS). To ensure that the OS system has a contiguous view of memory, the memory controller interleaves the row addresses of subarrays such that the D-group addresses across all subarrays are mapped contiguously to the physical address space.

With these address groups, the memory controller can use the existing command interface to communicate all variants of ACTIVATE to the DRAM chips. Depending on the address group, the DRAM chips internally process the ACTIVATE appropriately. For instance, by just issuing an ACTIVATE to address B12, the memory controller can simultaneously activate rows T0, T1, and T2. We will now describe how the memory controller uses this interface to express bitwise operations.

5.2 Executing Bitwise Ops: The AAP Primitive

Let us consider the operation, Dk = not Di. To perform this bitwise NOT operation, the memory controller sends the following sequence of commands.

1. ACTIVATE Di; 2. ACTIVATE B5; 3. PRECHARGE;
4. ACTIVATE B4; 5. ACTIVATE Dk; 6. PRECHARGE;

The first three steps are the same as those described in Section 4. These operations essentially copy the negation of row Di into the DCC row 0 (as described in Figure 6). Step 4 activates the d-wordline of the DCC row, transferring the negation of the source data on to the bitlines. Finally, Step 5 activates the destination row, copying the data on the bitlines, i.e., the negation of the source data, to the destination row.

If we observe the negation operation, it consists of two steps of ACTIVATE-ACTIVATE-PRECHARGE operations. We refer to this sequence as the AAP primitive. Each AAP takes two addresses as input. AAP (addr1, addr2) corresponds to the following sequence of commands: ACTIVATE addr1; ACTIVATE addr2; PRECHARGE; Logically, an AAP operation copies the result of activating the first address (addr1) to the row mapped to the second address (addr2).

We observe that most bitwise operations mainly involve a sequence of AAP operations. In a few cases, they require a regular ACTIVATE followed by a PRECHARGE, which refer to as AP. AP takes one address as input. AP (addr) corresponds to the following commands: ACTIVATE addr; PRECHARGE; Figure 8 shows the sequence of steps taken by the memory controller to execute three bitwise operations: and, nand, and xor.

Figure 8: Command sequences for different bitwise operations

Let us consider the and operation, Dk = Di and Dj. The four AAP operations directly map to the steps described in Section 3.4. The first AAP copies the first source row (Di) into the designated row T0. Similarly, the second AAP copies the second source row Dj to row T1, and the third AAP copies the control row “0” to row T2 (to perform a bitwise AND). Finally, the last AAP first issues an ACTIVATE to address B12. As described in Table 2, this command simultaneously activates the rows T0, T1, and T2, resulting in an and operation of the values of rows T0 and T1. This command is immediately followed by an ACTIVATE to Dk, which in effect copies the result of the and operation to the destination row Dk.

While an individual bitwise operation involves multiple copy operations, this overhead of copying can be reduced by applying standard compiler techniques. For instance, accumulation-like operations generate intermediate results that are immediately consumed. An optimization like dead-store elimination may prevent these values from being needlessly copied. Our evaluations (Section 8) consider the overhead of the copy operations without these optimizations.

5.3 Accelerating AAP with a Split Row Decoder

It is clear from Figure 8 that the latency of executing any bitwise operation using Buddy depends on the latency of the AAP primitive. The latency of the AAP in turn depends on the latency of ACTIVATE, i.e., t, and the latency of PRECHARGE, i.e., t. The naive approach to execute an AAP is to perform the three operations serially. Using this approach, the latency of AAP is 2t + t. While even this naive approach can offer better throughput and energy efficiency than existing systems (not shown due to space limitations), we propose an optimization that significantly reduces the latency of AAP.

Our optimization is based on the following two observations. First, the second ACTIVATE of an AAP is issued to an already activated bank. As a result, this ACTIVATE does not require full sense amplification, which is the dominant portion of t. Second, if we observe all the bitwise operations in Figure 8, with the exception of one AAP in nand, exactly one of the two ACTIVATEs in each AAP is to a B-group address.

To exploit these observations, our mechanism splits the row decoder into two parts. The first part decodes all C/D-group addresses and the second smaller part decodes only B-group addresses. Such a split allows the subarray to simultaneously decode a C/D-group address along with a B-group address. With this setup, if the memory controller issues the second ACTIVATE of an AAP after the first activation has sufficiently progressed, the sense amplifier will force the data of the second row(s) to the result of the first activation. This mechanism allows the memory controller to significantly overlap the latency of the two ACTIVATEs. This approach is similar to the inter-segment copy operation used by Tiered-Latency DRAM [43]

. Based on SPICE simulations, our estimate of the latency of executing both the

ACTIVATEs is 4 ns larger than t. For DDR3-1600 (8-8-8) timing parameters [30], this optimization reduces the latency of AAP from 80 ns to 49 ns.

In addition to reducing the latency of AAP, the split row decoder significantly reduces the complexity of the row decoding logic. Since only addresses in the B-group are involved in triple-row activations, the complexity of simultaneously raising three wordlines is restricted to the small B-group decoder.

5.4 DRAM Chip and Controller Cost

Buddy has three main sources of cost to the DRAM chip. First, it requires the row decoding logic to distinguish between the B-group addresses and the remaining addresses. Within the B-group, it must implement the mapping described in Table 2. As the B-group contains only 16 addresses, the complexity of the changes to the row decoding logic are low. The second source of cost is the implementation of the dual-contact cells (DCCs). In our design, each sense amplifier has only one DCC on each side, and each DCC has two wordlines associated with it. This design is similar to the one described in [31]. In terms of area, the cost of each DCC is roughly equivalent to two DRAM cells. The third source of cost is the capacity lost due to the reserved rows in the B-group and C-group. The system cannot use these rows to store application data. Our proposed implementation of Buddy reserves 10 rows in each subarray for the two groups. For a typical subarray size of 1024 rows, the loss in memory capacity is 1%.

DRAM manufacturers have to test chips to determine if TRA and the DCCs work as expected. However, since these operations concern only 8 DRAM rows of the B-group, we expect the additional overhead of testing to be low.

On the controller side, Buddy requires the memory controller to 1) store information about different address groups, 2) track the timing for different variants of the ACTIVATE (with or without the optimizations), and 3) track the status of different on-going bitwise operations. While scheduling different requests, the controller 1) adheres to power constraints like tFAW, and 2) can interleave the multiple AAP commands to perform a bitwise operation with other requests from different applications. We believe this modest increase in the DRAM chip/controller complexity is negligible compared to the improvement in throughput and energy enabled by Buddy (described in Sections 7 and 8).

6 Integrating Buddy with the System Stack

We envision two distinct ways to integrate Buddy into the system. The first way is a loose integration, where we treat Buddy as an accelerator like GPU. The second way is a tight integration, where we enable ISA support to integrate Buddy inside main memory. We now discuss both these ways.

6.1 Buddy as an Accelerator

In this approach, the manufacturer designs Buddy as an accelerator that can be plugged into the system as a separate device. We envision a system wherein the data structure that relies heavily on bitwise operations is designed to fit inside the accelerator memory, thereby minimizing communication between the CPU and the accelerator. In addition to the performance benefits of accelerating bitwise operations, this approach has two further benefits. First, a single manufacturer designs both the DRAM and the memory controller (not true of commodity DRAM). Second, the details of the data mapping to suit Buddy can be hidden behind the device driver, which can expose a simple-to-use API to the applications. Both these factors simplify the implementation. The execution model for this approach is similar to that of a GPU, wherein the programmer explicitly specifies the portions of the program that have to be executed in the Buddy accelerator.

6.2 Integrating Buddy with System Main Memory

A tighter integration of Buddy with the system main memory requires support from different layers of the system stack, which we discuss below.

6.2.1 ISA Support.

To enable software to communicate occurrences of bulk bitwise operations to the processor, we introduce new instructions of the form,

bop dst, src1, [src2], size

where bop is the bitwise operation to be performed, dst is the destination address, src1 and src2 are the source addresses, and size denotes the length of operation in bytes.

6.2.2 Implementing the New Buddy Instructions.

Since all Buddy operations are row-wide, Buddy requires the source and destination rows to be row-aligned and the operation to be at least the size of a DRAM row. The microarchitecture implementation of the bop instructions checks if each instance of these instructions satisfies this constraint. If so, the CPU sends the operation to the memory controller. The memory controller in turn determines the number of RowClone-PSM operations required to complete the bitwise operation. If the number of RowClone-PSM operations required is three (in which case performing the operation using the CPU will be faster, Section 3.5), or if the source/destination rows do not satisfy the alignment/size constraints, the CPU executes the operation itself. Otherwise, the memory controller completes the operation using Buddy. Note that the CPU performs the virtual-to-physical address translation of the source and destination rows before performing these checks and exporting the operations to the memory controller. Therefore, there is no need for any address translation support in the memory controller.

6.2.3 Maintaining On-chip Cache Coherence.

Buddy directly reads/modifies data in main memory. Therefore, before performing any Buddy operation, the memory controller must 1) flush any dirty cache lines from the source rows, and 2) invalidate any cache lines from destination rows. While flushing the dirty cache lines of the source rows is on the critical path of a Buddy operation, simple structures like the Dirty-Block Index [62] can speed up this step. Our mechanism invalidates the cache lines of the destination rows in parallel with the Buddy operation. Such a coherence mechanism is already required by Direct Memory Access (DMA) [20], which is supported by most modern processors.

6.2.4 Software Support.

The minimum support that Buddy requires from software is for the application to use the new Buddy instructions to communicate the occurrences of bulk bitwise operations to the processor. However, as an optimization to enable maximum benefit, the OS can allocate pages that are likely to be involved in a bitwise operation such that 1) they are row-aligned, and 2) belong to the same subarray. Note that the OS can still interleave the pages of a single data structure to multiple subarrays. Implementing this support requires the OS to be aware of the subarray mapping, i.e., determine if two physical pages belong to the same subarray or not. The OS can extract this information from the DRAM modules with the help of our memory controller (similar to [35, 43]).

7 Analysis of Throughput & Energy

We compare the throughput of Buddy for bulk bitwise operations to that of an Intel Skylake Core i7 system [1] and an NVIDIA GeForce GTX 745 GPU [6]. The Skylake system has 4 cores with support for Advanced Vector eXtensions [28], and two 64-bit DDR3-2133 channels. The GTX 745 contains 3 streaming multi-processors each with 128 cuda cores. The memory system consists of one 128-bit DDR3-1800 channel. For each bitwise operation, we run a microbenchmark that performs the operation repeatedly for many iterations on large input vectors (32 MB), and measure the throughput for the operation. Figure 9 plots the results of this experiment for six configurations: the Skylake system with 1, 2, and 4 cores, the GTX 745, and Buddy RAM with 1, 2, and 4 DRAM banks.

We draw three conclusions. First, for all bitwise operations, the throughput of the Skylake system is roughly the same for all three core configurations. We find that the available memory bandwidth limits the throughput of these operations, and hence using more cores is not beneficial. Second, while the GTX 745 slightly outperforms the Skylake system, its throughput is also limited by the available memory bandwidth. Although a more powerful GPU with more bandwidth would enable higher throughput, such high-end GPUs are significantly costlier and also consume very high energy. Third, even with a single DRAM bank, Buddy significantly outperforms both the Skylake and the GTX 745 for all bitwise operations (2.7X—6.4X better throughput than the GTX 745). More importantly, unlike the other two systems, Buddy is not limited by the memory channel bandwidth. Therefore, the throughput of Buddy scales linearly with increasing number of banks. Even with power constraints like tFAW, Buddy with two or four banks can achieve close to an order of magnitude higher throughput than the other two systems.

Figure 9: Comparison of throughput of bitwise operations. The values on top of each Buddy bar indicates the factor improvement in throughput of Buddy on top of the GTX 745.

We estimate energy for DDR3-1333 using the Rambus power model [57]. Our energy numbers include only the DRAM and channel energy, and not the energy consumed by the on-chip resources. For Buddy, some activate operations have to raise multiple wordlines and hence consume higher energy. Based on our analysis, we increase the activation energy by 22% for each additional wordline raised. Table 3 shows the energy consumed per kilo byte for different bitwise operations. Across all bitwise operations, Buddy reduces energy consumption by at least 25.1X (up to 59.5X) compared to the DDR interface.

Interface not and/or nand/nor xor/xnor
Energy DDR3 93.7 137.9 137.9 137.9
(nJ/KB) Buddy 1.6 3.2 4.0 5.5
() 59.5X 43.9X 35.1X 25.1X
Table 3: Comparison of energy for various groups of bitwise operations. () indicates reduction in energy of Buddy over the traditional DDR3 interface.

Based on these results, we conclude that for systems using DRAM-based main memory, Buddy is the most efficient way of performing bulk bitwise operations.

8 Effect on Real-World Applications

We evaluate the benefits of Buddy on real-world applications using the Gem5 simulator [16]. Table 4 lists the main simulation parameters. We assume that application data is mapped such that all bitwise operations happen across aligned-rows within a subarray. We quantitatively evaluate three applications: 1) a database bitmap index [52, 8, 9, 5], 2) BitWeaving [47], a mechanism to accelerate database column scan operations, and 3) a bitvector-based implementation of the widely-used set data structure. In Section 8.4, we discuss four other potential applications that can benefit from Buddy.

Processor x86, 8-wide, out-of-order, 4 Ghz
64 entry instruction queue
L1 cache 32 KB D-cache, 32 KB I-cache, LRU policy
L2 cache 2 MB, LRU policy, 64 B cache line size
Main memory DDR4-2400, 1-channel, 1-rank, 16 banks
Table 4: Major simulation parameters

8.1 Bitmap Indices

Bitmap indices [18] are an alternative to traditional B-tree indices for databases. Compared to B-trees, bitmap indices 1) consume less space, and 2) can perform better for many important queries (e.g., joins, scans). Several major databases support bitmap indices (e.g., Oracle [52], Redis [8], Fastbit [5], rlite [9]). Several real applications (e.g., Spool [10], Belly [2], bitmapist [3], Audience Insights [21]) use bitmap indices for fast analytics. As bitmap indices heavily rely on bulk bitwise operations, Buddy can accelerate bitmap indices, thereby improving overall application performance.

To demonstrate this benefit, we use the following workload from a real application [21]. The application uses bitmap indices to track users’ characteristics (e.g., gender) and activities (e.g., did the user log in to the website on day ’X’?) for users. The applications then uses bitwise operations on these bitmaps to answer different queries. Our workload runs the following query: “How many unique users were active every week for the past weeks? and How many male users were active each of the past weeks?” Executing this query requires 6 bitwise or, 2-1 bitwise and, and +1 bitcount operations. In our mechanism, Buddy accelerates the bitwise or and and operations in these queries, and the bitcount operations are performed by the CPU. Figure 10 shows the end-to-end query execution time of the baseline and Buddy for the above experiment for various values of and .

Figure 10: Performance of Buddy for bitmap indices The values on top of each bar indicates the factor reduction in execution time due to Buddy.

We draw two conclusions. First, as each query has bitwise operations and each bitwise operation takes time, the execution time of the query increases with increasing value . Second, Buddy significantly reduces the query execution time compared to the baseline, by 6X on average.

While we demonstrate the benefits of Buddy using one query, as all bitmap index queries involve several bitwise operations, Buddy would provide similar performance benefits for any application using bitmap indices.

8.2 BitWeaving: Fast Scans using Bitwise Operations

Column scan operations are a common part of many database queries. They are typically performed as part of evaluating a predicate. For a column with integer values, a predicate is typically of the form, c1 <= val <= c2. Recent works [47, 69] have observed that existing data representations for storing columnar data are inefficient for such predicate evaluation especially when the number of bits used to store each value of the column is less than the processor width. This is because 1) the values do not align well with the processor boundaries, and 2) the processor typically does not have comparison instructions at granularities smaller than the processor word. To address this problem, BitWeaving [47] proposes two different data representations called BitWeaving-H and BitWeaving-V. We focus our attention on the faster of the two mechanisms, BitWeaving-V.

BitWeaving-V stores the values of a column such that the first bit of all the values of the column are stored contiguously, the second bit of all the values of the column are stored contiguously, and so on. Using this representation, the predicate c1 <= val <= c2, can be represented as a series of bitwise operations starting from the most significant bit all the way to the least significant bit (we refer the reader to the BitWeaving paper [47] for the detailed algorithm). As these bitwise operations can be performed in parallel across multiple values of the column, BitWeaving uses the hardware SIMD support accelerate these operations. With support for Buddy, these operations can be performed in parallel across a much larger set of values, thereby enabling higher performance.

We show this benefit by comparing the performance of the baseline BitWeaving with the performance of BitWeaving accelerated by Buddy for a commonly-used query

select count(*) from T where c1 <= val <= c2

The query involves a series of bitwise operations to evaluate the predicate and a bitcount operation to compute the count(*). The execution time of this query depends on 1) the number of bits used to represent each value of val, b, and 2) the number of rows in the table T, r. Figure 11 shows the speedup of Buddy over BitWeaving for various values of b and r.

Figure 11: Speedup offered by Buddy for BitWeaving

We draw three conclusions. First, Buddy improves the performance of the query by between 1.8X and 11.8X (7.0X on average) compared to the baseline BitWeaving for various values of b and r. Second, the performance improvement of Buddy increases with increasing value of the number of bits per column b, because, as b increases, the fraction of time spent in performing the bitcount operation reduces. As a result, a larger fraction of the execution can be accelerated using Buddy. Third, for b = 4, 8, 12, and 16, we can observe a large jump in the speedup of Buddy. These are points where the working set stops fitting in the on-chip cache. By exploiting the high bank-level parallelism in DRAM, Buddy can outperform baseline BitWeaving (by up to 4.1X) even when the working set fits in the cache.

8.3 Bit Vectors vs. Red-Black Trees

Many algorithms heavily use the set data structure. While red-black trees [24] (RB-trees) are commonly used to implement a set (e.g., C++ Standard Template Library [4]), when the domain of elements is limited, we can implement a set using a bit vector. Bit vectors offer constant time insert and lookup as opposed to RB-trees, which consume time for both operations. However, with bit vectors, set operations like union, intersection, and difference have to operate on the entire bit vector, regardless of whether the elements are actually present in the set. As a result, for these operations, depending on the number of elements in the set, bit vectors may outperform or perform worse than RB-trees. With support for fast bulk bitwise operations, we show that Buddy significantly shifts the trade-off in favor of bit vectors.

To demonstrate this, we compare the performance of union, intersection, and difference operations using three implementations: RB-tree, bit vectors with SIMD optimization (Bitset), and bit vectors with Buddy. We run a benchmark that performs each operation on sets (containing elements between 1 and ) and stores the result in an output set. Figure 12 shows the execution time for each implementation normalized to RB-tree for the three operations for with varying number of elements in the input sets.

Figure 12: Comparison between RB-Tree, Bitset, and Buddy

We draw three conclusions. First, by enabling much higher throughput for bitwise operations, Buddy outperforms the baseline bitset on all the experiments. Second, as expected, when the number of elements in each set is very small (16 out of ), RB-Tree performs better than the bit vector based implementations. Third, even when each set contains only 64 or more (out of ) elements, Buddy significantly outperforms RB-Tree, 3X on average.

In summary, by performing bulk bitwise operations efficiently and with much higher throughput compared to existing systems, Buddy makes a bit-vector-based implementation of a set more attractive than red-black-trees.

8.4 Other Applications

8.4.1 Masked Initialization.

Certain operations have to clear a specific field in an array of objects. Such masked initializations [55] are very useful in applications like graphics (e.g., clearing a specific color in an image). Existing systems read the entire data structure into the processor/GPU before performing these operations. Fortunately, we can use bitwise AND/OR operations to express masked initializations, and consequently, Buddy can easily accelerate these operations.

8.4.2 Encryption.

Many encryption algorithms heavily use bitwise operations (e.g., XOR) [66, 26, 50]. The Buddy support for fast and efficient bitwise operations can i) boost the performance of existing encryption algorithms, and ii) enable new encryption algorithms with high throughput and efficiency.

8.4.3 DNA Sequence Mapping.

Most DNA sequence mapping algorithms [61] rely on identifying the locations where a small DNA sub-string occurs in the reference genome. As the reference genome is large, prior works have proposed a number of pre-processing algorithms [45, 60, 68, 72, 71, 58, 15] have to speedup this operation. Some of these prior works [71, 58, 15] heavily using bitwise operations. Buddy can significantly accelerate these bitwise operations, thereby reducing the overall time consumed by the DNA sequence mapping algorithm.

8.4.4 Approximate Statistics.

Certain large systems employ probabilistic data structures to improve the efficiency of maintaining statistics [17]. Many such structures (e.g., Bloom filters) rely on bitwise operations to achieve high efficiency. By improving the throughput of bitwise operations, Buddy can improve the efficiency of such data structures, and potentially enable the design of new data structures in this space.

9 Related Work

To our knowledge, this is the first work that proposes a mechanism to perform the functionally complete set of bitwise operations completely inside DRAM with high efficiency and low cost. While other works have explored using capacitors to implement logic gates [53], we are not aware of any work that exploits modern DRAM architecture to perform bitwise operations. Many prior works aim to enable efficient computation near memory. We now compare Buddy to these prior works.

Some recent patents [14, 13] from Mikamonu describe a new DRAM organization with 3T-1C cells and additional logic (e.g., muxes) to perform NAND/NOR operations on the data inside DRAM. While this architecture can perform bitwise operations inside DRAM, the 3T-1C cells results in significant additional area cost to the DRAM array, and hence greatly reduces overall memory density/capacity. In contrast, Buddy exploits existing DRAM cell structure and operation to perform bitwise operations efficiently inside DRAM. As a result, it incurs much lower cost compared to the Mikamonu architecture.

A recent paper proposes Pinatubo [46], a mechanism to perform bulk bitwise operations inside PCM. Similarly, a recent line of work [44, 38, 39, 40] proposes mechanisms to perform bitwise operations and other simple operations (3-bit full adder) completely inside a memristor array. First, as the underlying memory technology is different, the mechanisms proposed by these works is completely different from Buddy. Second, given that DRAM is faster than PCM or memristor, Buddy can offer much higher throughput compared to Pinatubo. Having said that, these works demonstrate the importance of improving the throughput of bulk bitwise operations.

A few recent works [63, 64, 32] exploit memory architectures to accelerate specific operations. RowClone [63] efficiently performs bulk copy and initialization inside DRAM. Kang et al. [32] propose a mechanism to exploit SRAM to accelerate “sum of absolute differences” computation. ISAAC [64] proposes a mechanism to accelerate dot-product computations using a memristor array. While these mechanisms significantly improve the efficiency of performing the respective operations, none of them can perform bitwise operations like Buddy.

Prior works (e.g., EXECUBE [37], IRAM [54], DIVA [22]) propose designs that integrate custom processing logic into the DRAM chip to perform bandwidth intensive operations. The idea is to design processing elements using the DRAM process, thereby allowing these elements to exploit the full bandwidth of DRAM. These approaches have two drawbacks. First, logic designed using DRAM process is generally slower than regular processors. Second, the logic added to DRAM significantly increases the chip cost. In contrast, we propose a low-cost mechanism that greatly accelerates bitwise operations.

Some recent DRAM architectures [49, 29, 7] use 3D-stacking to stack multiple DRAM chips on top of the processor chip or a separate logic layer. These architectures offer much higher bandwidth to the logic layer compared to traditional off-chip interfaces. This enables an opportunity to offload some computation to the logic layer, thereby improving performance. In fact, many recent works have proposed mechanisms to improve and exploit such architectures (e.g., [12, 11, 74, 23, 25]). Even though they higher bandwidth compared to off-chip memory, such 3D-stacked architectures are still bandwidth limited [42]. However, since Buddy can be integrated easily with such architectures, it can still offer significant performance and energy improvements in conjunction with 3D-stacking.

10 Conclusion

We introduced Buddy, a new substrate that performs row-wide bitwise operations within a DRAM chip by exploiting the analog operation of DRAM. Buddy consists of two components. The first component uses simultaneous activation of three DRAM rows to efficiently perform bitwise AND/OR operations. The second component uses the inverters present in each sense amplifier to efficiently implement bitwise NOT operations. With these two components, Buddy can perform any bitwise logical operation efficiently within DRAM. Our evaluations show that Buddy enables 10.9X–25.6X improvement in the throughput of bitwise operations. This improvement directly translates to significant performance improvement (3X–7X) in the evaluated data-intensive applications. Buddy is generally applicable to any memory architecture that uses DRAM technology. We believe that the support for fast and efficient bitwise operations in DRAM can enable better design of applications to take advantage of them, which would result in large improvements in performance and efficiency.

References

  • [1] 6th Generation Intel Core Processor Family Datasheet. http://www.intel.com/content/www/us/en/processors/core/desktop-6th-gen-core-family-datasheet-vol-1.html.
  • [2] Belly card engineering. https://tech.bellycard.com/.
  • [3] bitmapist: Powerful realtime analytics with Redis 2.6’s bitmaps and Python. https://github.com/Doist/bitmapist.
  • [4] C++ containers libary, std::set. http://en.cppreference.com/w/cpp/container/set.
  • [5] FastBit: An Efficient Compressed Bitmap Index Technology. https://sdm.lbl.gov/fastbit/.
  • [6] GeForce GTX 745 Specifications. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-745-oem/specifications.
  • [7] High Bandwidth Memory DRAM. http://www.jedec.org/standards-documents/docs/jesd235.
  • [8] Redis - bitmaps. http://redis.io/topics/data-types-intro#bitmaps.
  • [9] rlite: A Self-contained, Serverless, Zero-configuration, Transactional Redis-compatible Database Engine. https://github.com/seppo0010/rlite.
  • [10] Spool. http://www.getspool.com/.
  • [11] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 105–117, New York, NY, USA, 2015. ACM.
  • [12] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 336–348, New York, NY, USA, 2015. ACM.
  • [13] A. Akerib, O. AGAM, E. Ehrman, and M. Meyassed. Using storage cells to perform computation, December 2014. US Patent 8,908,465.
  • [14] Avidan Akerib and Eli Ehrman. In-memory Computational Device, Patent No. 20150146491, May 2015.
  • [15] Gary Benson, Yozen Hernandez, and Joshua Loving. A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm, pages 50–61. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
  • [16] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The Gem5 Simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011.
  • [17] Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. Summingbird: A framework for integrating batch and online mapreduce computations. Proc. VLDB Endow., 7(13):1441–1451, August 2014.
  • [18] Chee-Yong Chan and Yannis E. Ioannidis. Bitmap index design and evaluation. In SIGMOD, pages 355–366, New York, NY, USA, 1998. ACM.
  • [19] K. K.-W. Chang, D. Lee, Z. Chisti, A. R. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu. Improving DRAM Performance by Parallelizing Refreshes with Accesses. In HPCA, 2014.
  • [20] J. Corbet et al. Linux Device Drivers, page 445. O’Reilly Media, 2005.
  • [21] Demiz Denir, Islam AbdelRahman, Liang He, and Yingsheng Gao. Audience Insights query engine: In-memory integer store for social analytics. https://code.facebook.com/posts/382299771946304/ audience-insights-query-engine-in-memory- integer-store-for-social-analytics-/.
  • [22] Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. The Architecture of the DIVA Processing-in-memory Chip. In Proceedings of the 16th International Conference on Supercomputing, ICS ’02, pages 14–25, New York, NY, USA, 2002. ACM.
  • [23] A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung Kim. Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 283–295, Feb 2015.
  • [24] Leo J. Guibas and Robert Sedgewick. A Dichromatic Framework for Balanced Trees. In Proceedings of the 19th Annual Symposium on Foundations of Computer Science, SFCS ’78, pages 8–21, Washington, DC, USA, 1978. IEEE Computer Society.
  • [25] Qi Guo, Nikolaos Alachiotis, Berkin Akin, Fazle Sadi, Guanglin Xu, Tze Meng Low, Larry Pileggi, James C Hoe, and Franz Franchetti. 3D-stacked Memory-side Acceleration: Accelerator and System Design. In WoNDP, 2013.
  • [26] Jong-Wook Han, Choon-Sik Park, Dae-Hyun Ryu, and Eun-Soo Kim. Optical image encryption based on XOR operations. Optical Engineering, 38(1):47–54, 1999.
  • [27] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and O. Mutlu. ChargeCache: Reducing DRAM latency by exploiting row access locality. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 581–593, March 2016.
  • [28] Intel. Intel Instruction Set Architecture Extensions. https://software.intel.com/en-us/intel-isa-extensions.
  • [29] J. Jeddeloh and B. Keeth. Hybrid Memory Cube: New DRAM architecture increases density and performance. In VLSIT, pages 87–88, June 2012.
  • [30] JEDEC. DDR3 SDRAM Standard, JESD79-3D. http://www.jedec.org/sites/default/files/docs/JESD79-3D.pdf, 2009.
  • [31] H.B. Kang and S.K. Hong. One-transistor type dram, January 8 2009. US Patent App. 12/000,393.
  • [32] Mingu Kang, Min-Sun Keel, Naresh R Shanbhag, Sean Eilert, and Ken Curewitz.

    An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM.

    In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8326–8330. IEEE, 2014.
  • [33] Uksong Kang. Personal communication, Oct 2016.
  • [34] Brent Keeth, R. Jacob Baker, Brian Johnson, and Feng Lin. DRAM Circuit Design: Fundamental and High-Speed Topics. Wiley-IEEE Press, 2nd edition, 2007.
  • [35] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A Case for Exploiting Subarray-level Parallelism (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA ’12, pages 368–379, Washington, DC, USA, 2012. IEEE Computer Society.
  • [36] D. E. Knuth. The Art of Computer Programming. Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams, 2009.
  • [37] Peter M. Kogge. EXECUBE: A New Architecture for Scaleable MPPs. In ICPP, pages 77–84, Washington, DC, USA, 1994. IEEE Computer Society.
  • [38] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser. MAGIC —Memristor-Aided Logic. IEEE Transactions on Circuits and Systems II: Express Briefs, 61(11):895–899, Nov 2014.
  • [39] S. Kvatinsky, A. Kolodny, U. C. Weiser, and E. G. Friedman. Memristor-based IMPLY logic design procedure. In Computer Design (ICCD), 2011 IEEE 29th International Conference on, pages 142–147, Oct 2011.
  • [40] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser. Memristor-Based Material Implication (IMPLY) Logic: Design Principles and Methodologies. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22(10):2054–2066, Oct 2014.
  • [41] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu. Adaptive-latency DRAM: Optimizing DRAM timing for the common-case. In HPCA, pages 489–501, Feb 2015.
  • [42] Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost. ACM Trans. Archit. Code Optim., 12(4):63:1–63:29, January 2016.
  • [43] Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu. Tiered-latency DRAM: A Low Latency and Low Cost DRAM Architecture. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), HPCA ’13, pages 615–626, Washington, DC, USA, 2013. IEEE Computer Society.
  • [44] Yifat Levy, Jehoshua Bruck, Yuval Cassuto, Eby G. Friedman, Avinoam Kolodny, Eitan Yaakobi, and Shahar Kvatinsky. Logic operations in memory using a memristive Akers array. Microelectronics Journal, 45(11):1429 – 1437, 2014.
  • [45] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5):589–595, 2010.
  • [46] Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan Xie. Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-Volatile Memories. In Proceedings of the 53rd Annual Design Automation Conference, page 173. ACM, 2016.
  • [47] Yinan Li and Jignesh M. Patel. BitWeaving: Fast Scans for Main Memory Data Processing. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 289–300, New York, NY, USA, 2013. ACM.
  • [48] Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu. An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 60–71, New York, NY, USA, 2013. ACM.
  • [49] Gabriel H. Loh. 3D-Stacked Memory Architectures for Multi-core Processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pages 453–464, Washington, DC, USA, 2008. IEEE Computer Society.
  • [50] S. A. Manavski. CUDA Compatible GPU as an Efficient Hardware Accelerator for AES Cryptography. In IEEE International Conference on Signal Processing and Communications, 2007. ICSPC 2007, pages 65–68, Nov 2007.
  • [51] Elizabeth O’Neil, Patrick O’Neil, and Kesheng Wu. Bitmap index design choices and their performance implications. In Proceedings of the 11th International Database Engineering and Applications Symposium, IDEAS ’07, pages 72–84, Washington, DC, USA, 2007. IEEE Computer Society.
  • [52] Oracle. Using Bitmap Indexes in Data Warehouses. https://docs.oracle.com/cd/B28359_01/server.111/b28313/indexes.htm.
  • [53] H. Ozdemir, A. Kepkep, B. Pamir, Y. Leblebici, and U. Cilingiroglu. A capacitive threshold-logic gate. IEEE Journal of Solid-State Circuits, 31(8):1141–1150, Aug 1996.
  • [54] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2):34–44, March 1997.
  • [55] Alex Peleg and Uri Weiser. MMX Technology Extension to the Intel Architecture. IEEE Micro, 16(4):42–50, August 1996.
  • [56] PTM. Predictive technology model. http://ptm.asu.edu/.
  • [57] Rambus. DRAM Power Model. https://www.rambus.com/energy/, 2010.
  • [58] Kim R Rasmussen, Jens Stoye, and Eugene W Myers. Efficient q-gram filters for finding all -matches over a given length. Journal of Computational Biology, 13(2):296–308, 2006.
  • [59] P. J. Restle, J. W. Park, and B. F. Lloyd. DRAM variable retention time. In International Electron Devices Meeting, 1992. IEDM ’92. Technical Digest, pages 807–810, Dec 1992.
  • [60] Stephen M Rumble, Phil Lacroute, Adrian V Dalca, Marc Fiume, Arend Sidow, and Michael Brudno. SHRiMP: Accurate mapping of short color-space reads. 2009.
  • [61] Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, and Jean-François Gibrat. Mapping reads on a genomic sequence: An algorithmic overview and a practical comparative analysis. Journal of Computational Biology, 19(6):796–813, 2012.
  • [62] Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. The Dirty-block Index. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA ’14, pages 157–168, Piscataway, NJ, USA, 2014. IEEE Press.
  • [63] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 185–197, New York, NY, USA, 2013. ACM.
  • [64] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar.

    ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbarsb.

    In Proc. ISCA, 2016.
  • [65] W. Shin, J. Yang, J. Choi, and L. S. Kim. NUAT: A non-uniform access time memory controller. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 464–475, Feb 2014.
  • [66] P. Tuyls, H. D. L. Hollmann, J. H. Van Lint, and L. Tolhuizen. XOR-based Visual Cryptography Schemes. Designs, Codes and Cryptography, 37(1):169–186.
  • [67] Henry S. Warren. Hacker’s Delight. Addison-Wesley Professional, 2nd edition, 2012.
  • [68] David Weese, Anne-Katrin Emde, Tobias Rausch, Andreas Döring, and Knut Reinert. RazerS—fast read mapping with sensitivity control. Genome research, 19(9):1646–1654, 2009.
  • [69] Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. Vectorizing Database Column Scans with Complex Predicates. In Rajesh Bordawekar, Christian A. Lang, and Bugra Gedik, editors, ADMS@VLDB, pages 1–12, 2013.
  • [70] Kesheng Wu, Ekow J. Otoo, and Arie Shoshani. Compressing Bitmap Indexes for Faster Search Operations. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management, SSDBM ’02, pages 99–108, Washington, DC, USA, 2002. IEEE Computer Society.
  • [71] Hongyi Xin, John Greth, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu. Shifted Hamming Distance: A Fast and Accurate SIMD-Friendly Filter to Accelerate Alignment Verification in Read Mapping. Bioinformatics, 2015.
  • [72] Hongyi Xin, Donghyuk Lee, Farhad Hormozdiari, Samihan Yedkar, Onur Mutlu, and Can Alkan. Accelerating read mapping with FastHASH. BMC genomics, 14(Suppl 1):S13, 2013.
  • [73] D. S. Yaney, C. Y. Lu, R. A. Kohler, M. J. Kelly, and J. T. Nelson. A meta-stable leakage phenomenon in DRAM charge storage - Variable hold time. In 1987 International Electron Devices Meeting, pages 336–339, Dec 1987.
  • [74] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. Top-pim: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, pages 85–98, New York, NY, USA, 2014. ACM.
  • [75] Tao Zhang, Ke Chen, Cong Xu, Guangyu Sun, Tao Wang, and Yuan Xie. Half-DRAM: A High-bandwidth and Low-power DRAM Architecture from the Rethinking of Fine-grained Activation. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA ’14, pages 349–360, Piscataway, NJ, USA, 2014. IEEE Press.
  • [76] Wei Zhao and Yu Cao. New generation of predictive technology model for sub-45 nm early design exploration. IEEE TED, 2006.