Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.
Index Terms: Processing using Memory, DRAM, Bulk Copy, Bulk Initialization, Bulk Bitwise Operations, Performance, Energy Efficiency
Many applications trigger bulk bitwise operations
, i.e., bitwise operations on large bit vectors[btt-knuth, hacker-delight]. In databases, bitmap indices [bmide, bmidc], which heavily use bulk bitwise operations, are more efficient than B-trees for many queries [bmide, fastbit, bicompression]. In fact, many real-world databases [oracle, redis, fastbit, rlite] support bitmap indices. A recent work, WideTable [widetable], designs an entire database around a technique called BitWeaving [bitweaving], which accelerates scans completely using bulk bitwise operations. Microsoft recently open-sourced a technology called BitFunnel [bitfunnel] that accelerates the document filtering portion of web search. BitFunnel relies on fast bulk bitwise AND operations. Bulk bitwise operations are also prevalent in DNA sequence alignment [bitwise-alignment, shd, gatekeeper, grim, myers1999, nanopore-sequencing, shouji], encryption algorithms [xor1, xor2, enc1], graph processing [pinatubo, pim-enabled-insts, graphpim], networking [hacker-delight]
, and machine learning[neural-cache] . Thus, accelerating bulk bitwise operations can significantly boost the performance of various important applications.
In existing systems, a bulk bitwise operation requires a large amount of data to be transferred on the memory channel. Such large data transfers result in high latency, bandwidth, and energy consumption. In fact, our experiments on a multi-core Intel Skylake [intel-skylake] and an NVIDIA GeForce GTX 745 [gtx745] show that the available memory bandwidth of these systems limits the throughput of bulk bitwise operations. Recent works (e.g., [pim-enabled-insts, pim-graph, top-pim, nda, msa3d, pointer3d, tom, lazypim-cal, grim, gwcd, cds-nmc, pattnaik2016]) propose processing in the logic layer of 3D-stacked DRAM, which stacks DRAM layers on top of a logic layer (e.g., Hybrid Memory Cube [hmc, hmc2], High Bandwidth Memory [hbm, smla]). While the logic layer in 3D-stacked memory has much higher bandwidth than traditional systems, it still cannot exploit the maximum internal bandwidth available inside a DRAM chip [smla] (Section 7).
In this paper, we describe Ambit [ambit], a new mechanism proposed by recent work that performs bulk bitwise operation completely inside main memory. Ambit is an instance of the recently-introduced notion called Processing using Memory [pum-bookchapter]. In contrast to Processing in Memory architectures [pim-enabled-insts, pim-graph, top-pim, msa3d, spmm-mul-lim, data-access-opt-pim, tom, hrl, gp-simd, ndp-architecture, pim-analytics, nda, jafar, data-reorg-3d-stack, smla, lim-computer, non-von-machine, iram, execube, active-pages, pim-terasys, cram, bitwise-cal, rowclone, pointer3d, continuous-run-ahead, emc, gwcd, cds-nmc, lazypim-cal, graphpim, conda] that add extra computational logic closer to main memory, the idea behind Processing using Memory is to exploit the existing structure and organization of memory devices with minimal changes to provide additional functionality.
Along these lines, Ambit uses the analog operation principles of DRAM technology to perform bulk bitwise operations completely inside the memory array. With modest changes to the DRAM design, Ambit can exploit 1) the maximum internal bandwidth available inside each DRAM array, and 2) the memory-level parallelism [parbs, salp, glewmlp, tcm, runahead, mlp-prefetching] across multiple DRAM arrays to enable one to two orders of magnitude improvement in raw throughput and energy consumption of bulk bitwise operations. Ambit exposes a Bulk Bitwise Execution Model to the host processor. We show that real-world applications that heavily use bulk bitwise operations can use this model to achieve significant improvements in performance and energy efficiency.
In this paper, we discuss the following main concepts.
As Ambit builds on top of modern DRAM architecture, we first provide a brief background on modern DRAM organization and operation that is sufficient to understand the mechanisms proposed by Ambit (Section 2).
2 Background on DRAM
In this section, we describe the necessary background to understand modern DRAM architecture and its implementation. This paper builds on our previous book chapter that introduces the notion of Processing using Memory [pum-bookchapter]. Since that chapter provides a detailed background on DRAM, this section is mostly reproduced from the DRAM background section from chapter. While we focus our attention primarily on commodity DRAM design (i.e., the DDRx interface), most DRAM architectures use very similar design approaches and vary only in higher-level design choices [ramulator]. As a result, Ambit can be extended to any DRAM architecture. There has been significant recent research in DRAM architectures and the interested reader can find details about various aspects of DRAM in multiple recent publications [salp, tl-dram, al-dram, gs-dram, dsarp, ramulator, data-retention, parbor, fly-dram, efficacy-error-techniques, raidr, chargecache, avatar, diva-dram, reaper, softmc, lisa, smla, vivek-thesis, yoongu-thesis, donghyuk-thesis, kevin-thesis, rowclone, ddma, samira-micro17, samira-cal16, crow, cal-dram, drange, dlpuf, solar-dram, patel2019].
At the end of this section, we provide a brief overview of RowClone [rowclone], a prior work that enables the memory controller to perform row-wide copy and initialization operations completely inside DRAM. Ambit exploits RowClone to reduce the overhead of some of its bulk data copy and initialization operations.
2.1 High-level Organization of the Memory System
Figure 1 shows the organization of the memory subsystem in a modern computing system. At a high level, each processor chip is connected to of one of more off-chip memory channels. Each memory channel consists of its own set of command, address, and data buses. Depending on the design of the processor, there can be either an independent memory controller for each memory channel or a single memory controller for all memory channels. All memory modules connected to a channel share the buses of the channel. Each memory module consists of many DRAM devices (or chips). Most of this section is dedicated to describing the design of a modern DRAM chip. In Section 2.3, we present more details of the module organization of commodity DRAM.
2.2 DRAM Chip
A modern DRAM chip consists of a hierarchy of structures: DRAM cells, tiles/MATs, subarrays, and banks. In this section, we describe the design of a modern DRAM chip in a bottom-up fashion, starting from a single DRAM cell and its operation.
2.2.1 DRAM Cell and Sense Amplifier
At the lowest level, DRAM technology uses capacitors to store information. Specifically, it uses the two extreme states of a capacitor, namely, the empty and the fully charged states to store a single bit of information. For instance, an empty capacitor can denote a logical value of 0, and a fully charged capacitor can denote a logical value of 1. Figure 2 shows the two extreme states of a capacitor.
Unfortunately, the capacitors used for DRAM chips are small, and will get smaller with each new generation. As a result, the amount of charge that can be stored in the capacitor, and hence the difference between the two states is also very small. In addition, the capacitor can potentially lose its state after it is accessed. Therefore, to extract the state of the capacitor, DRAM manufacturers use a component called sense amplifier.
Figure 4 shows a sense amplifier. A sense amplifier contains two inverters which are connected together such that the output of one inverter is connected to the input of the other and vice versa. The sense amplifier also has an enable signal that determines if the inverters are active. When enabled, the sense amplifier has two stable states, as shown in Figure 4. In both these stable states, each inverter takes a logical value and feeds the other inverter with the negated input.
Figure 5 shows the operation of the sense amplifier, starting from a disabled state. In the initial disabled state, we assume that the voltage level of the top terminal (V) is higher than that of the bottom terminal (V). When the sense amplifier is enabled in this state, it senses the difference between the two terminals and amplifies the difference until it reaches one of the stable states (hence the name “sense amplifier”).
2.2.2 DRAM Cell Operation: The Activate-Precharge cycle
DRAM technology uses a simple mechanism that converts the logical state of a capacitor into a logical state of the sense amplifier. Data can then be accessed from the sense amplifier (since it is in a stable state). Figure 6 shows 1) the connection between a DRAM cell and the sense amplifier, and 2) the sequence of states involved in capturing the cell state in the sense amplifier.
As shown in the figure (state ➊), the capacitor is connected to an access transistor that acts as a switch between the capacitor and the sense amplifier. The transistor is controller by a wire called wordline. The wire that connects the transistor to the top end of the sense amplifier is called bitline. In the initial state ➊, the wordline is lowered, the sense amplifier is disabled and both ends of the sense amplifier are maintained at a voltage level of V. We assume that the capacitor is initially fully charged (the operation is similar if the capacitor was initially empty). This state is referred to as the precharged state. An access to the cell is triggered by a command called ACTIVATE. Upon receiving an ACTIVATE, the corresponding wordline is first raised (state ➋). This connects the capacitor to the bitline. In the ensuing phase called charge sharing (state ➌), charge flows from the capacitor to the bitline, raising the voltage level on the bitline (top end of the sense amplifier) to V. After charge sharing, the sense amplifier is enabled (state ➍). The sense amplifier detects the difference in voltage levels between its two ends and amplifies the deviation, till it reaches the stable state where the top end is at V (state ➎). Since the capacitor is still connected to the bitline, the charge on the capacitor is also fully restored. We shortly describe how the data can be accessed from the sense amplifier. However, once the access to the cell is complete, the cell is taken back to the original precharged state using the command called PRECHARGE. Upon receiving a PRECHARGE, the wordline is first lowered, thereby disconnecting the cell from the sense amplifier. Then, the two ends of the sense amplifier are driven to V using a precharge unit (not shown in the figure for brevity).
2.2.3 DRAM MAT/Tile: The Open Bitline Architecture
A major goal of DRAM manufacturers is to maximize the density of the DRAM chips while adhering to certain latency constraints (described in Section 2.2.6). There are two costly components in the setup described in the previous section. The first component is the sense amplifier itself. Each sense amplifier is around two orders of magnitude larger than a single DRAM cell [rambus-power, tl-dram]. Second, the state of the wordline is a function of the address that is currently being accessed. The logic that is necessary to implement this function (for each cell) is expensive.
In order to reduce the overall cost of these two components, they are shared by many DRAM cells. Specifically, each sense amplifier is shared by a column of DRAM cells. In other words, all the cells in a single column are connected to the same bitline. Similarly, each wordline is shared by a row of DRAM cells. Together, this organization consists of a 2-D array of DRAM cells connected to a row of sense amplifiers and a column of wordline drivers. Figure 7 shows this organization with a 2-D array.
To further reduce the overall cost of the sense amplifiers and the wordline driver, modern DRAM chips use an architecture called the open bitline architecture [lisa]. This architecture exploits two observations. First, the sense amplifier is wider than the DRAM cells. This difference in width results in a white space near each column of cells. Second, the sense amplifier is symmetric. Therefore, cells can also be connected to the bottom part of the sense amplifier. Putting together these two observations, we can pack twice as many cells in the same area using the open bitline architecture, as shown in Figure 8;
As shown in the figure, a 2-D array of DRAM cells is connected to two rows of sense amplifiers: one on the top and one on the bottom of the array. While all the cells in a given row share a common wordline, half the cells in each row are connected to the top row of sense amplifiers and the remaining half of the cells are connected to the bottom row of sense amplifiers. This tightly packed structure is called a DRAM MAT/Tile [rethinking-dram, half-dram, salp]. In a modern DRAM chip, each MAT typically is a or array. Multiple MATs are grouped together to form a larger structure called a DRAM bank, which we describe next.
2.2.4 DRAM Bank
In most modern commodity DRAM interfaces [ddr3, ddr4, ramulator], a DRAM bank is the smallest structure visible to the memory controller. All commands related to data access are directed to a specific bank. Logically, each DRAM bank is a large monolithic structure with a 2-D array of DRAM cells connected to a single set of sense amplifiers (also referred to as a row buffer). For example, in a 2Gb DRAM chip with 8 banks, each bank has rows and each logical row has 8192 DRAM cells. Figure 9 shows this logical view of a bank.
In addition to the MAT, the array of sense amplifiers, and the wordline driver, each bank also consists of some peripheral structures to decode DRAM commands and addresses, and manage the inputs/outputs to the DRAM bank. Specifically, each bank has a row decoder to decode the row address associated with row-level commands (e.g., ACTIVATE). Each data access command (READ and WRITE) accesses only a part of a DRAM row. Such an individual part is referred to as a column. With each data access command, the address of the column to be accessed is provided. This address is decoded by the column selection logic. Depending on which column is selected, the corresponding piece of data is communicated between the sense amplifiers and the bank I/O logic. The bank I/O logic in turn acts as an interface between the DRAM bank and the chip-level I/O logic.
Although the bank can logically be viewed as a single MAT, building a single MAT of a very large size is practically not feasible as it would require very long bitlines and wordlines (leading to very high latencies). Therefore, each bank is physically implemented as a 2-D array of DRAM MATs. Figure 10 shows a physical implementation of the DRAM bank with 4 MATs arranged in a array. As shown in the figure, the output of the global row decoder is sent to each row of MATs. The bank I/O logic, also known as the global sense amplifiers, are connected to all the MATs through a set of global bitlines. As shown in the figure, each vertical collection of MATs has its own column selection logic (CSL) and global bitlines. In a real DRAM chip, the global bitlines run on top of the MATs in a separate metal layer. Data of each column is split equally across a single row of MATs. With this data organization, each global bitline needs to be connected only to bitlines within one MAT. While prior work [rethinking-dram] has explored routing mechanisms to connect each global bitline to all MATs, such a design incurs high complexity and overhead.
Figure 11 shows the zoomed-in version of a DRAM MAT with the surrounding peripheral logic. Specifically, the figure shows how each column selection logic selects specific sense amplifiers from a MAT and connects them to the global bitlines. We note that the width of the global bitlines for each MAT (typically 8/16) is much smaller than that of the width of the MAT (typically 512/1024). This is because the global bitlines span a much longer distance across the chip and hence have to be wider to ensure signal integrity.
Each DRAM chip consists of multiple banks, as shown in Figure 12. All the banks share the chip’s internal command, address, and data buses. As mentioned before, each bank operates mostly independently (except for operations that involve the shared buses). The chip I/O logic manages the transfer of data to and from the chip’s internal bus to the memory channel. The width of the chip output (typically 8 bits) is much smaller than the output width of each bank (typically 64 bits). Any piece of data accessed from a DRAM bank is first buffered at the chip I/O and sent out on the memory bus 8 bits at a time. With the DDR (double data rate) technology, 8 bits are sent out each half cycle. Therefore, it takes 4 cycles to transfer 64 bits of data from a DRAM chip I/O logic on to the memory channel.
2.2.5 DRAM Commands: Accessing Data from a DRAM Chip
To access a piece of data from a DRAM chip, the memory controller must first identify the location of the data: the bank ID (), the row address () within the bank, and the column address () within the row. After identifying these pieces of information, accessing the data involves three steps.
The first step is to issue a PRECHARGE to the bank . This step prepares the bank for a data access by ensuring that all the sense amplifiers are in the precharged state (Figure 6, state ➊). No wordline within the bank is raised in this state.
The second step is to activate the row that contains the data. This step is triggered by issuing an ACTIVATE to bank with row address . Upon receiving this command, the corresponding bank feeds its global row decoder with the input . The global row decoder logic then raises the wordline of the DRAM row corresponding to address and enables the sense amplifiers connected to that row. This triggers the DRAM cell operation described in Section 2.2.2. At the end of the activate operation, the data from the entire row of DRAM cells is copied to the corresponding array of sense amplifiers.
The third and final step is to access the data from the required column. This is done by issuing a READ or WRITE command to the bank with the column address . Upon receiving a READ or WRITE command, the corresponding address is fed to the column selection logic. The column selection logic then raises the column selection lines (Figure 11) corresponding to address , thereby connecting those local sense amplifiers to the global sense amplifiers through the global bitlines. For a read access, the global sense amplifiers sense the data from the MAT’s local sense amplifiers and transfer that data to the chip’s internal bus. For a write access, the global sense amplifiers read the data from the chip’s internal bus and force the MAT’s local sense amplifiers to the appropriate state.
Not all data accesses require all three steps. Specifically, if the row that is to be accessed is already activated in the corresponding bank, then the first two steps can be skipped and the data can be directly accessed by issuing a READ or WRITE to the bank. For this reason, the array of sense amplifiers are also referred to as a row buffer, and such an access that skips the first two steps is called a row buffer hit. Similarly, if the bank is already in the precharged state, then the first step can be skipped. Such an access is referred to as a row buffer miss. Finally, if a different row is activated within the bank, then all three steps have to be performed. Such an access is referred to as a row buffer conflict.
2.2.6 DRAM Timing Constraints
Different operations within DRAM consume different amounts of time. Therefore, after issuing a command, the memory controller must wait for a sufficient amount of time before it can issue the next command. Such wait times are managed by pre-specified fixed delays, called the timing constraints. Timing constraints essentially dictate the minimum amount of time between two commands issued to the same bank/rank/channel. Table 1 describes some key timing constraints along with their values for the DDR3-1600 interface. The reason as to why these constraints exist is discussed in prior works [salp, tl-dram, chang2017, smc].
|tRAS||ACTIVATE||PRECHARGE||Time taken to complete a row activation operation in a bank||35|
|tRCD||ACTIVATE||READ/WRITE||Time between an activate command and a column command to a bank||15|
|tRP||PRECHARGE||ACTIVATE||Time taken to complete a precharge operation in a bank||15|
|tWR||WRITE||PRECHARGE||Time taken to ensure that data is safely written to the DRAM cells after a write operation (called write recovery)||15|
2.3 DRAM Module
As mentioned before, each READ or WRITE command for a single DRAM chip typically involves only 64 bits. In order to achieve high memory bandwidth, commodity DRAM modules group several DRAM chips (typically 4 or 8) together to form a rank of DRAM chips. The idea is to connect all chips of a single rank to the same command and address buses, while providing each chip with an independent data bus. In effect, all the chips within a rank receive the same command with the same address, making the rank a logically large DRAM chip.
Figure 13 shows the logical organization of a DRAM rank. Many commodity DRAM ranks consist of 8 chips with each chip accessing 8 bytes of data in response to each READ or WRITE command. Therefore, in total, each READ or WRITE command accesses 64 bytes of data, the typical cache line size in many processors.
2.4 RowClone: Bulk Copy and Initialization using DRAM
RowClone [rowclone] is a mechanism to perform bulk copy and initialization operations completely inside the DRAM. This approach obviates the need to transfer large quantities of data on the memory channel, thereby significantly improving the efficiency of a bulk copy operation. As bulk data initialization (and specifically bulk zeroing) can be viewed as a special case of a bulk copy operation, RowClone can be easily extended to perform such bulk initialization operations with high efficiency.
RowClone consists of two independent mechanisms that exploit several observations about DRAM organization and operation. The first mechanism, called the Fast Parallel Mode (FPM), efficiently copies data between two rows of DRAM cells that share the same set of sense amplifiers (i.e., two rows within the same subarray). To copy data from a source row to a destination row within the same subarray, RowClone-FPM first issues an ACTIVATE to the source row, immediately followed by an ACTIVATE to the destination row. We show that this sequence of back-to-back row activations inside the same subarray results in a data copy from the source row to the destination row. Figure 14 shows the operation of RowClone-FPM on a single cell.
Without any further optimizations, the latency of RowClone-FPM is equivalent to that of two row activations followed by a precharge, which is an order of magnitude faster than existing systems [rowclone].
The second mechanism, called the Pipelined Serial Mode (PSM), efficiently copies cache lines between two banks within a module in a pipelined manner. To copy data from a source row in one bank to a destination row in a different bank, RowClone-PSM first activates both the rows. It then uses a newly-proposed command, TRANSFER, to copy a single cache line from the source row directly into the destination row, without having to send the data outside the chip. Figure 15 compares the data path of READ, WRITE, and TRANSFER. Although not as fast as FPM, PSM has fewer constraints and hence is more generally applicable [rowclone].
We refer the reader to [rowclone] for detailed information on RowClone.
3 Ambit: A Bulk Bitwise Execution Engine
In this section, we describe the design and implementation of Ambit, a new mechanism that converts DRAM into a bulk bitwise execution engine with low cost. As mentioned in the introduction, bulk bitwise operations are triggered by many important applications—e.g., bitmap indices [bmide, bmidc, fastbit, oracle, redis, rlite], databases [bitweaving, widetable], document filtering [bitfunnel], DNA processing [bitwise-alignment, shd, gatekeeper, grim, myers1999, nanopore-sequencing, shouji], encryption algorithms [xor1, xor2, enc1], graph processing [pinatubo], and networking [hacker-delight]. Accelerating bulk bitwise operations can significantly boost the performance of these applications.
The first component of our mechanism, Ambit-AND-OR, uses the analog nature of the charge sharing phase to perform bulk bitwise AND and OR directly inside the DRAM chip. It specifically exploits two facts about DRAM operation:
In a subarray, each sense amplifier is shared by many (typically 512 or 1024) DRAM cells on the same bitline.
The final state of the bitline after sense amplification depends primarily on the voltage deviation on the bitline after the charge sharing phase.
Based on these facts, we observe that simultaneously activating three cells, rather than a single cell, results in a bitwise majority function—i.e., at least two cells have to be fully charged for the final state to be a logical “1”. We refer to simultaneous activation of three cells (or rows) as triple-row activation. We now conceptually describe triple-row activation and how we use it to perform bulk bitwise AND and OR operations.
3.1.1 Triple-Row Activation (TRA)
A triple-row activation (TRA) simultaneously connects a sense amplifier with three DRAM cells on the same bitline. For ease of conceptual understanding, let us assume that the three cells have the same capacitance, the transistors and bitlines behave ideally (no resistance), and the cells start at a fully refreshed state. Then, based on charge sharing principles [dram-cd], the bitline deviation at the end of the charge sharing phase of the TRA is:
where, is the bitline deviation, is the cell capacitance, is the bitline capacitance, and is the number of cells in the fully charged state. It is clear that if and only if . In other words, the bitline deviation is positive if and it is negative if . Therefore, we expect the final state of the bitline to be V if at least two of the three cells are initially fully charged, and the final state to be , if at least two of the three cells are initially fully empty.
Figure 16 shows an example TRA where two of the three cells are initially in the charged state ➊. When the wordlines of all the three cells are raised simultaneously ➋, charge sharing results in a positive deviation on the bitline. Therefore, after sense amplification ➌, the sense amplifier drives the bitline to V, and as a result, fully charges all the three cells.111Modern DRAMs use an open-bitline architecture [diva-dram, lisa, data-retention, dram-cd] (Section 2.2.3), where cells are also connected to . The three cells in our example are connected to the bitline. However, based on the duality principle of Boolean algebra [boolean], i.e., not (A and B) (not A) or (not B), TRA works seamlessly even if all the three cells are connected to .
If , , and represent the logical values of the three cells, then the final state of the bitline is (the bitwise majority function). Importantly, we can rewrite this expression as . In other words, by controlling the value of the cell , we can use TRA to execute a bitwise AND or bitwise OR of the cells and . Since activation is a row-level operation in DRAM, TRA operates on an entire row of DRAM cells and sense amplifiers, thereby enabling a multi-kilobyte-wide bitwise AND/OR of two rows.
3.1.2 Making TRA Work
There are five potential issues with TRA that we need to resolve for it to be implementable in a real DRAM design.
When simultaneously activating three cells, the deviation on the bitline may be smaller than when activating only one cell. This may lengthen sense amplification or worse, the sense amplifier may detect the wrong value.
Equation 1 assumes that all cells have the same capacitance, and that the transistors and bitlines behave ideally. However, due to process variation, these assumptions are not true in real designs [al-dram, diva-dram, chang2017]. This can affect the reliability of TRA, and thus the correctness of its results.
As shown in Figure 16 (state ➌), TRA overwrites the data of all the three cells with the final result value. In other words, TRA overwrites all source cells, thereby destroying their original values.
Equation 1 assumes that the cells involved in a TRA are either fully-charged or fully-empty. However, DRAM cells leak charge over time [raidr]. If the cells involved have leaked significantly, TRA may not operate as expected.
Simultaneously activating three arbitrary rows inside a DRAM subarray requires the memory controller and the row decoder to simultaneously communicate and decode three row addresses. This introduces a large cost on the address bus and the row decoder, potentially tripling these structures, if implemented naïvely.
3.1.3 Implementation of Ambit-AND-OR
To solve issues 3, 4, and 5 described in Section 3.1.2, our implementation of Ambit reserves a set of designated rows in each subarray that are used to perform TRAs. These designated rows are chosen statically at design time. To perform a bulk bitwise AND or OR operation on two arbitrary source rows, our mechanism first copies the data of the source rows into the designated rows and performs the required TRA on the designated rows. As an example, to perform a bitwise AND/OR of two rows A and B, and store the result in row R, our mechanism performs the following steps.
Copy data of row A to designated row T0
Copy data of row B to designated row T1
Initialize designated row T2 to
Activate designated rows T0, T1, and T2 simultaneously
Copy data of row T0 to row R
Let us understand how this implementation addresses the last three issues described in Section 3.1.2. First, by performing the TRA on the designated rows, and not directly on the source data, our mechanism avoids overwriting the source data (issue 3). Second, each copy operation refreshes the cells of the destination row by accessing the row [raidr]. Also, each copy operation takes five-six orders of magnitude lower latency (100 ns—1 ) than the refresh interval (64 ms). Since these copy operations (Steps 1 and 2 above) are performed just before the TRA, the rows involved in the TRA are very close to the fully-refreshed state just before the TRA operation (issue 4). Finally, since the designated rows are chosen statically at design time, the Ambit controller uses a reserved address to communicate the TRA of a pre-defined set of three designated rows. To this end, Ambit reserves a set of row addresses just to trigger TRAs. For instance, in our implementation to perform a TRA of designated rows T0, T1, and T2 (Step 4, above), the Ambit controller simply issues an ACTIVATE with the reserved address B12 (see Section 4.1 for a full list of reserved addresses). The row decoder maps B12 to all the three wordlines of the designated rows T0, T1, and T2. This mechanism requires no changes to the address bus and significantly reduces the cost and complexity of the row decoder compared to performing TRA on three arbitrary rows (issue 5).
3.1.4 Fast Row Copy and Initialization Using RowClone
Our mechanism needs three row copy operations and one row initialization operation. These operations, if performed naïvely, can nullify the benefits of Ambit, as a row copy or row initialization performed using the memory controller incurs high latency [rowclone, lisa]. Fortunately, a recent work, RowClone [rowclone], proposes two techniques to efficiently copy data between rows directly within DRAM. The first technique, RowClone-FPM (Fast Parallel Mode), copies data within a subarray by issuing two back-to-back ACTIVATEs to the source row and the destination row. This operation takes only 80 ns [rowclone]. The second technique, RowClone-PSM (Pipelined Serial Mode), copies data between two banks by using the internal DRAM bus. Although RowClone-PSM is faster and more efficient than copying data using the memory controller, it is significantly slower than RowClone-FPM.
Ambit relies on using RowClone-FPM for most of the copy operations. To enable this, we propose three ideas. First, to allow Ambit to perform the initialization operation using RowClone-FPM, we reserve two control rows in each subarray, C0 and C1. C0 is initialized to all zeros and C1 is initialized to all ones. Depending on the operation to be performed, bitwise AND or OR, Ambit copies the data from C0 or C1 to the appropriate designated row using RowClone-FPM. Second, we reserve separate designated rows in each subarray. This allows each subarray to perform bulk bitwise AND/OR operations on the rows that belong to that subarray by using RowClone-FPM for all the required copy operations. Third, to ensure that bulk bitwise operations are predominantly performed between rows inside the same subarray, we rely on 1) an accelerator API that allows applications to specify bitvectors that are likely to be involved in bitwise operations, and 2) a driver that maps such bitvectors to the same subarray (described in Section 5.2). With these changes, Ambit can use RowClone-FPM for a significant majority of the bulk copy operations, thereby ensuring high performance for the bulk bitwise operations.
A recent work, Low-cost Interlinked Subarrays (LISA) [lisa], proposes a mechanism to efficiently copy data across subarrays in the same bank. LISA uses a row of isolation transistors next to the sense amplifier to control data transfer across two subarrays. LISA can potentially benefit Ambit by improving the performance of bulk copy operations. However, as we will describe in Section 3.2, Ambit-NOT also adds transistors near the sense amplifier, posing some challenges in integrating LISA and Ambit. Therefore, we leave the exploration of using LISA to speed up Ambit as part of future work.
Ambit-NOT exploits the fact that at the end of the sense amplification process, the voltage level of the represents the negated logical value of the cell. Our key idea to perform bulk bitwise NOT in DRAM is to transfer the data on the to a cell that can also be connected to the bitline. For this purpose, we introduce the dual-contact cell (shown in Figure 17). A dual-contact cell (DCC) is a DRAM cell with two transistors (a 2T-1C cell similar to the one described in [2t-1c-1, migration-cell]). Figure 17 shows a DCC connected to a sense amplifier. In a DCC, one transistor connects the cell capacitor to the bitline and the other transistor connects the cell capacitor to the . We refer to the wordline that controls the capacitor-bitline connection as the d-wordline (or data wordline) and the wordline that controls the capacitor- connection as the n-wordline (or negation wordline). The layout of the dual-contact cell is similar to Lu et al.’s migration cell [migration-cell].
Figure 18 shows the steps involved in transferring the negated value of a source cell on to the DCC connected to the same bitline (i.e., sense amplifier) ➊. Our mechanism first activates the source cell ➋. The activation drives the bitline to the data value corresponding to the source cell, V in this case and the to the negated value, i.e., 0 ➌. In this activated state, our mechanism activates the n-wordline. Doing so enables the transistor that connects the DCC to the ➍. Since the is already at a stable voltage level of , it overwrites the value in the DCC capacitor with , thereby copying the negated value of the source cell into the DCC. After this, our mechanism precharges the bank, and then copies the negated value from the DCC to the destination cell using RowClone.
Implementation of Ambit-NOT. Based on Lu et al.’s [migration-cell] layout, the cost of each row of DCC is the same as two regular DRAM rows. Similar to the designated rows used for Ambit-AND-OR (Section 3.1.3), the Ambit controller uses reserved row addresses to control the d-wordlines and n-wordlines of the DCC rows—e.g., in our implementation, address B5 maps to the n-wordline of the DCC row (Section 4.1). To perform a bitwise NOT of row A and store the result in row R, the Ambit controller performs the following steps.
Activate row A
Activate n-wordline of DCC (address B5)
Precharge the bank.
Copy data from d-wordline of DCC to row R (RowClone)
4 Ambit: Full Design and Implementation
In this section, we describe our implementation of Ambit by integrating Ambit-AND-OR and Ambit-NOT. First, both Ambit-AND-OR and Ambit-NOT reserve a set of rows in each subarray and a set of addresses that map to these rows. We present the full set of reserved addresses and their mapping in detail (Section 4.1). Second, we introduce a new primitive called AAP (ACTIVATE-ACTIVATE-PRECHARGE) that the Ambit controller uses to execute various bulk bitwise operations (Section 4.2). Third, we describe an optimization that lowers the latency of the AAP primitive, further improving the performance of Ambit (Section 4.3). Fourth, we describe how we integrate Ambit with the system stack (Section 5). Finally, we evaluate the hardware cost of Ambit (Section 5.6).
4.1 Row Address Grouping
Our implementation divides the space of row addresses in each subarray into three distinct groups (Figure 19): 1) Bitwise group, 2) Control group, and 3) Data group.
The B-group (or the bitwise group) corresponds to the designated rows used to perform bulk bitwise AND/OR operations (Section 3.1.3) and the dual-contact rows used to perform bulk bitwise NOT operations (Section 3.2). Minimally, Ambit requires three designated rows (to perform triple row activations) and one row of dual-contact cells in each subarray. However, to reduce the number of copy operations required by certain bitwise operations (like xor and xnor), we design each subarray with four designated rows, namely T0—T3, and two rows of dual-contact cells, one on each side of the row of sense amplifiers.222Each xor/xnor operation involves multiple and, or, and not operations. We use the additional designated row and the DCC row to store intermediate results computed as part of the xor/xnor operation (see Figure 20c). We refer to the d-wordlines of the two DCC rows as DCC0 and DCC1, and the corresponding n-wordlines as and . The B-group contains 16 reserved addresses: B0—B15. Table 2 lists the mapping between the 16 addresses and the wordlines. The first eight addresses individually activate each of the 8 wordlines in the group. Addresses B12—B15 activate three wordlines simultaneously. Ambit uses these addresses to trigger triple-row activations. Finally, addresses B8—B11 activate two wordlines. Ambit uses these addresses to copy the result of an operation simultaneously to two rows. This is useful for xor/xnor operations to simultaneously negate a row of source data and also copy the source row to a designated row. Note that this is just an example implementation of Ambit and a real implementation may use more designated rows in the B-group, thereby enabling more complex bulk bitwise operations with fewer copy operations.
|B12||T0, T1, T2|
|B13||T1, T2, T3|
|B14||DCC0, T1, T2|
|B15||DCC1, T0, T3|
The C-group (or the control group) contains the two pre-initialized rows for controlling the bitwise AND/OR operations (Section 3.1.4). Specifically, this group contains two addresses: C0 (row with all zeros) and C1 (row with all ones).
The D-group (or the data group) corresponds to the rows that store regular data. This group contains all the addresses that are neither in the B-group nor in the C-group. Specifically, if each subarray contains rows, then the D-group contains addresses, labeled D0—D1005. Ambit exposes only the D-group addresses to the software stack. To ensure that the software stack has a contiguous view of memory, the Ambit controller interleaves the row addresses such that the D-group addresses across all subarrays are mapped contiguously to the processor’s physical address space.
With these groups, the Ambit controller can use the existing DRAM interface to communicate all variants of ACTIVATE to the Ambit chip without requiring new commands. Depending on the address group, the Ambit DRAM chip internally processes each ACTIVATE appropriately. For instance, by just issuing an ACTIVATE to address B12, the Ambit controller triggers a triple-row activation of T0, T1, and T2. We now describe how the Ambit controller uses this row address mapping to perform bulk bitwise operations.
4.2 Executing Bitwise Ops: The AAP Primitive
Let us consider the operation, Dk = not Di. To perform this bitwise-NOT operation, the Ambit controller sends the following sequence of commands.
|1.||ACTIVATE Di;||2.||ACTIVATE B5;||3.||PRECHARGE;|
|4.||ACTIVATE B4;||5.||ACTIVATE Dk;||6.||PRECHARGE;|
The first three steps are the same as those described in Section 3.2. These three operations copy the negated value of row Di into the DCC0 row (as described in Figure 18). Step 4 activates DCC0, the d-wordline of the first DCC row, transferring the negated source data onto the bitlines. Step 5 activates the destination row, copying the data on the bitlines, i.e., the negated source data, to the destination row. Step 6 prepares the array for the next access by issuing a PRECHARGE.
The bitwise-NOT operation consists of two steps of
ACTIVATE-ACTIVATE-PRECHARGE operations. We refer to this sequence as the
AAP primitive. Each AAP takes two addresses as
input. AAP (addr1, addr2) corresponds to the following
sequence of commands:
ACTIVATE addr1; ACTIVATE addr2; PRECHARGE;
Logically, an AAP operation copies the result of the row activation of the first address (addr1) to the row(s) mapped to the second address (addr2).
Most bulk bitwise operations mainly involve a sequence of AAP operations. In a few cases, they require a regular ACTIVATE followed by
a PRECHARGE, which we refer to as AP. AP takes one address as
input. AP (addr) maps to the following two
ACTIVATE addr; PRECHARGE;
Figure 20 shows the steps taken by the Ambit controller to execute three bulk bitwise operations: and, nand, and xor.
Let us consider the and operation, Dk = Di and Dj, shown in Figure 20a. The four AAP operations directly map to the steps described in Section 3.1.3. The first AAP copies the first source row (Di) into the designated row T0. Similarly, the second AAP copies the second source row Dj to row T1, and the third AAP copies the control row “0” to row T2 (to perform a bulk bitwise AND). Finally, the last AAP 1) issues an ACTIVATE to address B12, which simultaneously activates the rows T0, T1, and T2, resulting in an and operation of the rows T0 and T1, 2) issues an ACTIVATE to Dk, which copies the result of the and operation to the destination row Dk, and 3) precharges the bank to prepare it for the next access.
While each bulk bitwise operation involves multiple copy operations, this copy overhead can be reduced by applying standard compilation techniques. For instance, accumulation-like operations generate intermediate results that are immediately consumed. An optimization like dead-store elimination may prevent these values from being copied unnecessarily. Our evaluations (Section 8) take into account the overhead of the copy operations without such optimizations.
4.3 Accelerating AAP with a Split Row Decoder
The latency of executing any bulk bitwise operation using Ambit depends on the latency of the AAP primitive. The latency of the AAP in turn depends on the latency of ACTIVATE, i.e., t, and the latency of PRECHARGE, i.e., t. The naïve approach to execute an AAP is to perform the three operations serially. Using this approach, the latency of AAP is 2t + t (80 ns for DDR3-1600 [ddr3-1600]). While even this naïve approach offers better throughput and energy efficiency than existing systems (not shown here), we propose a simple optimization that significantly reduces the latency of AAP.
Our optimization is based on two observations. First, the second ACTIVATE of an AAP is issued to an already activated bank. As a result, this ACTIVATE does not require full sense amplification, which is the dominant portion of t [diva-dram, al-dram, chargecache]. This enables the opportunity to reduce the latency for the second ACTIVATE of each AAP. Second, when we examine all the bitwise operations in Figure 20, with the exception of one AAP in nand, we find that exactly one of the two ACTIVATEs in each AAP is to a B-group address. This enables the opportunity to use a separate decoder for B-group addresses, thereby overlapping the latency of the two row activations in each AAP.
To exploit both of these observations, our mechanism splits the row decoder into two parts. The first part decodes all C/D-group addresses and the second smaller part decodes only B-group addresses. Such a split allows the subarray to simultaneously decode a C/D-group address along with a B-group address. When executing an AAP, the Ambit controller issues the second ACTIVATE of an AAP after the first activation has sufficiently progressed. This forces the sense amplifier to overwrite the data of the second row to the result of the first activation. This mechanism allows the Ambit controller to significantly overlap the latency of the two ACTIVATEs. This approach is similar to the inter-segment copy operation used by Tiered-Latency DRAM [tl-dram]
. Based on SPICE simulations, our estimate of the latency of executing the back-to-backACTIVATEs is only 4 ns larger than t. For DDR3-1600 (8-8-8) timing parameters [ddr3-1600], this optimization reduces the latency of AAP from 80 ns to 49 ns.
Since only addresses in the B-group are involved in triple-row activations, the complexity of simultaneously raising three wordlines is restricted to the small B-group decoder. As a result, the split row decoder also reduces the complexity of the changes Ambit introduces to the row decoding logic.
5 Integrating Ambit with the System
Ambit can be plugged in as an I/O (e.g., PCIe) device and interfaced with the CPU using a device model similar to other accelerators (e.g., GPU). While this approach is simple, as described in previous sections, the address and command interface of Ambit is exactly the same as that of commodity DRAM. This enables the opportunity to directly plug Ambit onto the system memory bus and control it using the memory controller. This approach has several benefits. First, applications can directly trigger Ambit operations using CPU instructions rather than going through a device API, which incurs additional overhead. Second, since the CPU can directly access Ambit memory, there is no need to copy data between the CPU memory and the accelerator memory. Third, existing cache coherence protocols can be used to keep Ambit memory and the on-chip cache coherent. To plug Ambit onto the system memory bus, we need additional support from the rest of the system stack, which we describe in this section.
5.1 ISA Support
To enable applications to communicate occurrences of bulk bitwise
operations to the processor, we introduce new instructions of the
bbop dst, src1, [src2], size
where bbop is the bulk bitwise operation, dst is the destination address, src1 and src2 are the source addresses, and size denotes the length of operation in bytes. Note that size
must be a multiple of DRAM row size. For bitvectors that are not a multiple of DRAM row size, we assume that the application will appropriately pad them with dummy data, or perform the residual (sub-row-sized) operations using the CPU.
5.2 Ambit API/Driver Support
For Ambit to provide significant performance benefit over existing systems, it is critical to ensure that most of the required copy operations are performed using RowClone-FPM, i.e., the source rows and the destination rows involved in bulk bitwise operations are present in the same DRAM subarray. To this end, we expect the manufacturer of Ambit to provide 1) an API that enables applications to specify bitvectors that are likely to be involved in bitwise operations, and 2) a driver that is aware of the internal mapping of DRAM rows to subarrays and maps the bitvectors involved in bulk bitwise operations to the same DRAM subarray. Note that for a large bitvector, Ambit does not require the entire bitvector to fit inside a single subarray. Rather, each bitvector can be interleaved across multiple subarrays such that the corresponding portions of each bitvector are in the same subarray. Since each subarray contains over 1000 rows to store application data, an application can map hundreds of large bitvectors to Ambit, such that the copy operations required by all the bitwise operations across all these bitvectors can be performed efficiently using RowClone-FPM.
5.3 Implementing the bbop Instructions
Since all Ambit operations are row-wide, Ambit requires the source and destination rows to be row-aligned and the size of the operation to be a multiple of the size of a DRAM row. The microarchitecture implementation of a bbop instruction checks if each instance of the instruction satisfies this constraint. If so, the CPU sends the operation to the memory controller, which completes the operation using Ambit. Otherwise, the CPU executes the operation itself.
5.4 Maintaining On-chip Cache Coherence
Since both CPU and Ambit can access/modify data in memory, before performing any Ambit operation, the memory controller must 1) flush any dirty cache lines from the source rows, and 2) invalidate any cache lines from destination rows. Such a mechanism is already required by Direct Memory Access (DMA) [linux-dma], which is supported by most modern processors, and also by recently proposed mechanisms [rowclone, pointer3d, lazypim-cal, lazypim-arxiv]. As Ambit operations are always row-wide, we can use structures like the Dirty-Block Index [dbi] to speed up flushing dirty data. Our mechanism invalidates the cache lines of the destination rows in parallel with the Ambit operation. Other recently-proposed techniques like LazyPIM [lazypim-cal] and CoNDA [conda] can also be employed with Ambit.
5.5 Error Correction and Data Scrambling
In DRAM modules that support Error Correction Code (ECC), the memory controller must first read the data and ECC to verify data integrity. Since Ambit reads and modifies data directly in memory, it does not work with the existing ECC schemes (e.g., SECDED [secded]). To support ECC with Ambit, we need an ECC scheme that is homomorphic [homomorphism] over all bitwise operations, i.e., ECC(A and B) = ECC(A) and ECC(B), and similarly for other bitwise operations. The only scheme that we are aware of that has this property is triple modular redundancy (TMR) [tmr], wherein ECC(A) = AA. The design of lower-overhead ECC schemes that are homomorphic over all bitwise operations is an open problem. For the same reason, Ambit does not work with data scrambling mechanisms that pseudo-randomly modify the data written to DRAM [data-scrambling]. Note that these challenges are also present in any mechanism that interprets data directly in memory (e.g., the Automata Processor [automata, pap]). We leave the evaluation of Ambit with TMR and exploration of other ECC and data scrambling schemes to future work.
5.6 Ambit Hardware Cost
As Ambit largely exploits the structure and operation of existing DRAM design, we estimate its hardware cost in terms of the overhead it imposes on top of today’s DRAM chip and memory controller.
5.6.1 Ambit Chip Cost
In addition to support for RowClone, Ambit has only two changes on top of the existing DRAM chip design. First, it requires the row decoding logic to distinguish between the B-group addresses and the remaining addresses. Within the B-group, it must implement the mapping described in Table 2. As the B-group contains only 16 addresses, the complexity of the changes to the row decoding logic is low. The second source of cost is the implementation of the dual-contact cells (DCCs). In our design, each sense amplifier has only one DCC on each side, and each DCC has two wordlines associated with it. In terms of area, each DCC row costs roughly two DRAM rows, based on estimates from Lu et al. [migration-cell]. We estimate the overall storage cost of Ambit to be roughly 8 DRAM rows per subarray—for the four designated rows and the DCC rows ( of DRAM chip area).
5.6.2 Ambit Controller Cost
On top of the existing memory controller, the Ambit controller must statically store 1) information about different address groups, 2) the timing of different variants of the ACTIVATE, and 3) the sequence of commands required to complete different bitwise operations. When Ambit is plugged onto the system memory bus, the controller can interleave the various AAP operations in the bitwise operations with other regular memory requests from different applications. For this purpose, the Ambit controller must also track the status of on-going bitwise operations. We expect the overhead of these additional pieces of information to be small compared to the benefits enabled by Ambit.
5.6.3 Ambit Testing Cost
Testing Ambit chips is similar to testing regular DRAM chips. In addition to the regular DRAM rows, the manufacturer must test if the TRA operations and the DCC rows work as expected. In each subarray with 1024 rows, these operations concern only 8 DRAM rows and 16 addresses of the B-group. In addition, all these operations are triggered using the ACTIVATE command. Therefore, we expect the overhead of testing an Ambit chip on top of testing a regular DRAM chip to be low.
When a component is found to be faulty during testing, DRAM manufacturers use a number of techniques to improve the overall yield; The most prominent among them is using spare rows to replace faulty DRAM rows. Similar to some prior works [rowclone, tl-dram, al-dram, salp], Ambit requires faulty rows to be mapped to spare rows within the same subarray. Note that, since Ambit preserves the existing DRAM command interface, an Ambit chip that fails during testing can still be shipped as a regular DRAM chip. This significantly reduces any potential negative impact of Ambit-specific failures on overall DRAM yield.
6 Circuit-level SPICE Simulations
We use SPICE simulations to confirm that Ambit works reliably. Of the two components of Ambit, our SPICE results show that Ambit-NOT always works as expected and is not affected by process variation. This is because, Ambit-NOT operation is very similar to existing DRAM operation (Section 3.2). On the other hand, Ambit-AND-OR requires triple-row activation, which involves charge sharing between three cells on a bitline. As a result, it can be affected by process variation in various circuit components.
To study the effect of process variation on TRA, our SPICE simulations model variation in all the components in the subarray (cell capacitance; transistor length, width, resistance; bitline/wordline capacitance and resistance; voltage levels). We implement the sense amplifier using 55nm DDR3 model parameters [rambus], and PTM low-power transistor models [ptmweb, zhaoptm]. We use cell/transistor parameters from the Rambus power model [rambus] (cell capacitance = 22fF; transistor width/height = 55nm/85nm).333In DRAM, temperature affects mainly cell leakage [al-dram, data-retention, vrt-1, vrt-2, raidr, reaper, chang2017, softmc, rowhammer, dram-latency-puf, fly-dram, vampire, drange]. As TRA is performed on cells that are almost fully-refreshed, we do not expect temperature to affect TRA.
We first identify the worst case for TRA, wherein every component has process variation that works toward making TRA fail. Our results show that even in this extremely adversarial scenario, TRA works reliably for up to 6% variation in each component.
In practice, variations across components are not so highly correlated. Therefore, we use Monte-Carlo simulations to understand the practical impact of process variation on TRA. We increase the amount of process variation from 5% to 25% and run 100,000 simulations for each level of process variation. Table 3 shows the percentage of iterations in which TRA operates incorrectly for each level of variation.
Two conclusions are in order. First, as expected, up to 5% variation, there are zero errors in TRA. Second, even with 10% and 15% variation, the percentage of erroneous TRAs across 100,000 iterations each is just 0.29% and 6.01%. These results show that Ambit is reliable even in the presence of significant process variation.
The effect of process variation is expected to get worse with smaller technology nodes [kang2014]. However, as Ambit largely uses the existing DRAM structure and operation, many techniques used to combat process variation in existing chips can be used for Ambit as well (e.g., spare rows or columns). In addition, as described in Section 5.6.3, Ambit chips that fail testing only for TRA can potentially be shipped as regular DRAM chips, thereby alleviating the impact of TRA failures on overall DRAM yield, and thus cost.
Note that if an application can tolerate errors in computation, as observed by prior works [approx-computing, yixin-dsn, rfvp-taco, rfvp-pact], then Ambit can be used as an approximate in-DRAM computation substrate even in the presence of significant process variation. Future work can potentially explore such an approximate Ambit substrate.
7 Analysis of Ambit’s Throughput & Energy
We compare the raw throughput of bulk bitwise operations using Ambit to a multi-core Intel Skylake CPU [intel-skylake], an NVIDIA GeForce GTX 745 GPU [gtx745], and processing in the logic layer of an HMC 2.0 [hmc2] device. The Intel CPU has 4 cores with Advanced Vector eXtensions [intel-avx], and two 64-bit DDR3-2133 channels. The GTX 745 contains 3 streaming multi-processors, each with 128 CUDA cores [tesla], and one 128-bit DDR3-1800 channel. The HMC 2.0 device consists of 32 vaults each with 10 GB/s bandwidth. We use two Ambit configurations: Ambit that integrates our mechanism into a regular DRAM module with 8 banks, and Ambit-3D that extends a 3D-stacked DRAM similar to HMC with support for Ambit. For each bitwise operation, we run a microbenchmark that performs the operation repeatedly for many iterations on large input vectors (32 MB), and measure the throughput of the operation. Figure 21 plots the results of this experiment for the five systems (the y-axis is in log scale).
We draw three conclusions. First, the throughput of Skylake, GTX 745, and HMC 2.0 are limited by the memory bandwidth available to the respective processors. With an order of magnitude higher available memory bandwidth, HMC 2.0 achieves 18.5X and 13.1X better throughput for bulk bitwise operations compared to Skylake and GTX 745, respectively. Second, Ambit, with its ability to exploit the maximum internal DRAM bandwidth and memory-level parallelism, outperforms all three systems. On average, Ambit (with 8 DRAM banks) outperforms Skylake by 44.9X, GTX 745 by 32.0X, and HMC 2.0 by 2.4X. Third, 3D-stacked DRAM architectures like HMC contain a large number of banks (256 banks in 4GB HMC 2.0). By extending 3D-stacked DRAM with support for Ambit, Ambit-3D improves the throughput of bulk bitwise operations by 9.7X compared to HMC 2.0.
We estimate energy for DDR3-1333 using the Rambus power model [rambus]. Our energy numbers include only the DRAM and channel energy, and not the energy consumed by the processor. For Ambit, some activations have to raise multiple wordlines and hence, consume higher energy. Based on our analysis, the activation energy increases by 22% for each additional wordline raised. Table 4 shows the energy consumed per kilo-byte for different bitwise operations. Across all bitwise operations, Ambit reduces energy consumption by 25.1X—59.5X compared to copying data with the memory controller using the DDR3 interface.
8 Effect on Real-World Applications
We evaluate the benefits of Ambit on real-world applications using the Gem5 full-system simulator [gem5]. Table 5 lists the main simulation parameters. Our simulations take into account the cost of maintaining coherence, and the overhead of RowClone to perform copy operations. We assume that application data is mapped such that all bitwise operations happen across rows within a subarray. We quantitatively evaluate three applications: 1) a database bitmap index [oracle, redis, rlite, fastbit], 2) BitWeaving [bitweaving], a mechanism to accelerate database column scan operations, and 3) a bitvector-based implementation of the widely-used set data structure. In Section 8.4, we discuss four other applications that can benefit from Ambit.
|Processor||x86, 8-wide, out-of-order, 4 Ghz|
|64-entry instruction queue|
|L1 cache||32 KB D-cache, 32 KB I-cache, LRU policy|
|L2 cache||2 MB, LRU policy, 64 B cache line size|
|Memory Controller||8 KB row size, FR-FCFS [frfcfs, frfcfs-patent] scheduling|
|Main memory||DDR4-2400, 1-channel, 1-rank, 16 banks|
8.1 Bitmap Indices
Bitmap indices [bmide] are an alternative to traditional B-tree indices for databases. Compared to B-trees, bitmap indices 1) consume less space, and 2) can perform better for many queries (e.g., joins, scans). Several major databases support bitmap indices (e.g., Oracle [oracle], Redis [redis], Fastbit [fastbit], rlite [rlite]). Several real applications (e.g., [spool, belly, bitmapist, ai]) use bitmap indices for fast analytics. As bitmap indices heavily rely on bulk bitwise operations, Ambit can accelerate bitmap indices, thereby improving overall application performance.
To demonstrate this benefit, we use the following workload from a real application [ai]. The application uses bitmap indices to track users’ characteristics (e.g., gender) and activities (e.g., did the user log in to the website on day ’X’?) for users. Our workload runs the following query: “How many unique users were active every week for the past weeks? and How many male users were active each of the past weeks?” Executing this query requires 6 bulk bitwise or, 2-1 bulk bitwise and, and +1 bulk bitcount operations. In our mechanism, the bitcount operations are performed by the CPU. Figure 22 shows the end-to-end query execution time of the baseline and Ambit for the above experiment for various values of and .
We draw two conclusions. First, as each query has bulk bitwise operations and each bulk bitwise operation takes time, the query execution time increases with increasing value . Second, Ambit significantly reduces the query execution time compared to the baseline, by 6X on average.
While we demonstrate the benefits of Ambit using one query, as all bitmap index queries involve several bulk bitwise operations, we expect Ambit to provide similar performance benefits for any application using bitmap indices.
8.2 BitWeaving: Fast Scans using Bitwise Operations
Column scan operations are a common part of many database queries. They are typically performed as part of evaluating a predicate. For a column with integer values, a predicate is typically of the form, c1 <= val <= c2, for two integer constants c1 and c2. Recent works [bitweaving, vectorizing-column-scans] observe that existing data representations for storing columnar data are inefficient for such predicate evaluation especially when the number of bits used to store each value of the column is less than the processor word size (typically 32 or 64). This is because 1) the values do not align well with word boundaries, and 2) the processor typically does not have comparison instructions at granularities smaller than the word size. To address this problem, BitWeaving [bitweaving] proposes two column representations, called BitWeaving-H and BitWeaving-V. As BitWeaving-V is faster than BitWeaving-H, we focus our attention on BitWeaving-V, and refer to it as just BitWeaving.
BitWeaving stores the values of a column such that the first bit of all the values of the column are stored contiguously, the second bit of all the values of the column are stored contiguously, and so on. Using this representation, the predicate c1 <= val <= c2, can be represented as a series of bitwise operations starting from the most significant bit all the way to the least significant bit (we refer the reader to the BitWeaving paper [bitweaving] for the detailed algorithm). As these bitwise operations can be performed in parallel across multiple values of the column, BitWeaving uses the hardware SIMD support to accelerate these operations. With support for Ambit, these operations can be performed in parallel across a larger set of values compared to 128/256-bit SIMD available in existing CPUs, thereby enabling higher performance.
We show this benefit by comparing the performance of BitWeaving using
a baseline CPU with support for 128-bit SIMD to the performance of
BitWeaving accelerated by Ambit for the following commonly-used query
on a table T:
‘select count(*) from T where c1 <= val <= c2’
Evaluating the predicate involves a series of bulk bitwise operations and the count(*) requires a bitcount operation. The execution time of the query depends on 1) the number of bits (b) used to represent each value val, and 2) the number of rows (r) in the table T. Figure 23 shows the speedup of Ambit over the baseline for various values of b and r.
We draw three conclusions. First, Ambit improves the performance of the query by between 1.8X and 11.8X (7.0X on average) compared to the baseline for various values of b and r. Second, the performance improvement of Ambit increases with increasing number of bits per column (b), because, as b increases, the fraction of time spent in performing the bitcount operation reduces. As a result, a larger fraction of the execution time can be accelerated using Ambit. Third, for b = 4, 8, 12, and 16, we observe large jumps in the speedup of Ambit as we increase the row count. These large jumps occur at points where the working set stops fitting in the on-chip cache. By exploiting the high bank-level parallelism in DRAM, Ambit can outperform the baseline (by up to 4.1X) even when the working set fits in the cache.
8.3 Bitvectors vs. Red-Black Trees
Many algorithms heavily use the set data structure. Red-black trees [red-black-tree] (RB-trees) are typically used to implement a set [stl]. However, a set with a limited domain can be implemented using a bitvector—a set that contains only elements from to , can be represented using an -bit bitvector (e.g., Bitset [stl]). Each bit indicates whether the corresponding element is present in the set. Bitvectors provide constant-time insert and lookup operations compared to time taken by RB-trees. However, set operations like union, intersection, and difference have to scan the entire bitvector regardless of the number of elements actually present in the set. As a result, for these three operations, depending on the number of elements in the set, bitvectors may outperform or perform worse than RB-trees. With support for fast bulk bitwise operations, we show that Ambit significantly shifts the trade-off in favor of bitvectors for these three operations.
To demonstrate this, we compare the performance of set union, intersection, and difference using: RB-tree, bitvectors with 128-bit SIMD support (Bitset), and bitvectors with Ambit. We run a benchmark that performs each operation on input sets and stores the result in an output set. We restrict the domain of the elements to be from to . Therefore, each set can be represented using an -bit bitvector. For each of the three operations, we run multiple experiments varying the number of elements (e) actually present in each input set. Figure 24 shows the execution time of RB-tree, Bitset, and Ambit normalized to RB-tree for the three operations for , and .
We draw three conclusions. First, by enabling much higher throughput for bulk bitwise operations, Ambit outperforms the baseline Bitset on all the experiments. Second, as expected, when the number of elements in each set is very small (16 out of ), RB-Tree performs better than Bitset and Ambit (with the exception of union). Third, even when each set contains only 64 or more elements out of , Ambit significantly outperforms RB-Tree, 3X on average. We conclude that Ambit makes the bitvector-based implementation of a set more attractive than the commonly-used red-black-tree-based implementation.
8.4 Other Applications
We describe five other examples of applications that can significantly benefit from Ambit in terms of both performance and energy efficiency. We leave the evaluation of these applications with Ambit to future works.
8.4.1 BitFunnel: Web Search
Web search is an important workload in modern systems. From the time a query is issued to the time the results are sent back, web search involves several steps. Document filtering is one of the most time consuming steps. It identifies all documents that contain all the words in the input query. Microsoft recently open-sourced BitFunnel [bitfunnel], a technology that improves the efficiency of document filtering. BitFunnel represents both documents and queries as a bag of words using Bloom filters [bloomfilter]. It then uses bitwise AND operations on specific locations of the Bloom filters to efficiently identify documents that contain all the query words. With Ambit, this operation can be significantly accelerated by simultaneously performing the filtering for thousands of documents.
8.4.2 Masked Initialization
Masked initializations [intel-mmx] are very useful in applications like graphics (e.g., for clearing a specific color in an image). Masked operations can be represented using bitwise AND or OR operations. For example, the bit of an integer can be set using . Similarly, the bit can be reset using . These masks can be preloaded into DRAM rows and can be used to perform masked operations on a large amount of data. This operation can be easily accelerated using Ambit.
Many encryption algorithms heavily use bitwise operations (e.g., XOR) [xor1, xor2, enc1]. Ambit’s support for fast bulk bitwise operations can i) boost the performance of existing encryption algorithms, and ii) enable new encryption algorithms with high throughput and efficiency.
8.4.4 DNA Sequence Mapping
In DNA sequence mapping, prior works [dna-algo1, dna-algo2, dna-algo3, dna-algo4, shd, dna-our-algo, bitwise-alignment, gatekeeper, grim, lee2015, nanopore-sequencing, shouji, magnet] propose algorithms to map sequenced reads to the reference genome. Some works [shd, dna-our-algo, bitwise-alignment, gatekeeper, grim, myers1999, shouji] heavily use bulk bitwise operations. Ambit can significantly improve the performance of such DNA sequence mapping algorithms [grim].
8.4.5 Machine Learning
Recent works have shown that deep neural networks (DNNs) outperform other machine learning techniques in many problems such as image classification and speech recognition. Given the high computational requirements of DNNs, there has been a focus on neural networks with binary values[xnor-net, bnn]. A recent work also exploits bit-serial computation to execute DNN inference algorithms using SIMD bitwise operations in the on-chip SRAM cache [neural-cache]. In conjunction with such techniques that increase the fraction of bitwise operations in DNNs, Ambit can significantly improve the performance of DNN algorithms.
9 Future Work
There are several avenues for future work that can build on top of the Ambit proposal. We briefly describe these avenues in this section.
9.1 Extending Ambit to Other Operations
We envision two major class of operations that can significantly increase the domain of applications that can benefit from Ambit.
The first operation is count. Counting the number of non-zero bits can be a very useful operation in many applications, including the ones evaluated in this paper (Section 8). Extending this operation to count the number of non-zero integers (of different widths) can further expand the scope of this operation.
The second operation is shift. Most arithmetic operations require some kind of bitwise shift. Shifting is also heavily used in any encryption algorithms [xor1, xor2, enc1] and bioinformatics algorithms [bitwise-alignment, shd, gatekeeper, grim, myers1999, shouji]. Extending Ambit to support bit shifting at different granularities can allow Ambit to significantly accelerate these applications. While recent works [drisa, dracc] explore a possible implementation of shifting inside DRAM, several challenges remain unaddressed (e.g., data mapping, handling column failures in DRAM).
9.2 Evaluation of New Applications with Ambit
In Section 8.4, we discussed five other applications that can benefit significantly from Ambit. A concrete evaluation of these applications can 1) further strengthen the case for Ambit, and 2) reveal other potential operations that may require acceleration in DRAM in order to obtain good end-to-end application speedups (e.g., bitcount).
9.3 Redesigning Applications to Exploit Ambit
In Section 8.2, we show how the BitWeaving [bitweaving] technique can be accelerated using Ambit. BitWeaving is a technique that redesigns the data layout of tables in a database and uses an appropriate algorithm to execute large scan queries efficiently. It is the modified data layout that makes BitWeaving amenable for SIMD/Ambit acceleration. We believe similar careful redesign techniques can be used for other important workloads such as graph processing, machine learning, bioinformatics algorithms, etc., that can make them very amenable for acceleration using Ambit.
9.4 Taking Advantage of Approximate Ambit
Ambit requires a costly ECC mechanism to avoid errors during computation. In addition, the error rate may increase with increasing process variation. However, if an application can tolerate errors in computation, as observed by prior works [approx-computing, yixin-dsn, rfvp-taco, rfvp-pact], then Ambit can be used as an approximate in-DRAM computation substrate.
In this paper, we focused on bulk bitwise operations, a class of operations heavily used by some important applications. Existing systems are inefficient at performing such operations as they have to transfer a large amount of data on the memory channel, resulting in high latency, high memory bandwidth consumption, and high energy consumption.
We described Ambit, which employs the notion of Processing using Memory introduced in a recent work [pum-bookchapter]. Ambit converts DRAM-based main memory into a bulk bitwise operation execution engine that performs bulk bitwise operations completely inside the memory [ambit]. Ambit has two component mechanisms. The first mechanism exploits the fact that many DRAM cells share the same sense amplifier and uses the idea of simultaneous activation of three rows of DRAM cells to perform bitwise MAJORITY/AND/OR operations efficiently. The second mechanism uses the inverters already present inside DRAM sense amplifiers to efficiently perform bitwise NOT operations. Since Ambit heavily exploits the internal organization and operation of DRAM, it incurs very low cost on top of the commodity DRAM architecture ( chip area overhead).
Our evaluations show that Ambit enables between one to two order of magnitude improvement in raw throughput and energy consumption of bulk bitwise operations compared to existing DDR interfaces. We describe many real-world applications that can take advantage of Ambit. Our evaluations with three such applications show that Ambit can improve average performance of these applications compared to the baseline by between 3.0X-11.8X. Given its low cost and large performance improvements, we believe Ambit is a promising execution substrate that can accelerate applications that heavily use bulk bitwise operations. We hope future work will uncover even more potential in Ambit like execution.
We thank the reviewers of ISCA 2016/2017, MICRO 2016/2017, and HPCA 2017 for their valuable comments on various drafts of shorter versions of this work. We thank the members of the SAFARI group and PDL for their feedback. We acknowledge the generous support of our industrial partners, over the years, especially AliBaba, Google, Huawei, Intel, Microsoft, Nvidia, Samsung, Seagate, and VMWare. This work was supported in part by NSF, SRC, and the Intel Science and Technology Center for Cloud Computing. Some components of this work appeared in IEEE CAL [bitwise-cal], MICRO [ambit], ADCOM [pum-bookchapter], and Seshadri’s Ph.D. thesis [vivek-thesis].