The Bitlet Model: Defining a Litmus Test for the Bitwise Processing-in-Memory Paradigm

10/22/2019 ∙ by Kunal Korgaonkar, et al. ∙ 0

This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to understand the affinity of workloads to processing-in-memory (PIM) as opposed to traditional computing. The tool uncovers interesting trade-offs between operation complexity (cycles required to perform an operation through PIM) and other key parameters, such as system memory bandwidth, data transfer size, the extent of data alignment, and effective memory capacity involved in PIM computations. Despite its simplicity, the model has already proven useful. In the future, we intend to extend and refine Bitlet to further increase its utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Processing huge amounts of data on traditional von Neumann architectures, involves many data transfers between the CPU and the memory. These transfers degrade performance and consume energy [18, 20, 21, 6, 4]. Enabled by emerging memory technologies, recent processing-in-memory (PIM) solutions show great potential in reducing costly data transfers by performing computations using individual memory cells [19, 2, 24, 17, 13]. This line of research has led to better circuits and micro-architectures [13, 14, 1], as well as applications using this paradigm [12, 8].

Despite the recent resurgence of PIM, it is still very challenging to analyze and quantify the advantages or disadvantages of PIM solutions over other computing paradigms. We believe a useful analytical modeling tool for PIM can play a crucial role. An analytical tool in this context has many potential uses, such as in (i) evaluation of applications mapped to PIM, (ii) comparison of PIM versus traditional architectures, and (iii) analysis of the implications of new memory technology trends on PIM.

Our Bitlet model is an analytical modeling tool that addresses the challenge of better understanding PIM relative to traditional CPU/GPU computing. The name Bitlet reflects PIM’s unique bit-by-bit data element processing approach. The model is inspired by past successful analytical models for computing [7, 10, 23, 5, 11] and provides a simple operational view of PIM computations.

The main contributions of this work are:

  • Presentation of the Bitlet model, an analytical modeling tool that abstracts algorithmic, technological as well as architectural machine parameters for PIM.

  • Definition of a litmus test for workloads to assess their affinity on PIM as compared to the CPU.

  • Delineation of the strengths and weaknesses of the new PIM paradigm as observed in a sensitivity study evaluating PIM performance and efficiency over various Bitlet model parameters.

2 The Bitlet Model

We derive a parameterized throughput metric for PIM followed by one for the CPU. The throughput focus is in alignment with the parallelism that the PIM approach offers. We first describe the model and then proceed to explain how to apply it. Throughout the paper, we refer to ‘PIM’ as a framework for processing inside memories.

2.1 Deriving PIM Throughput

We based the PIM side of the Bitlet model on the principle of performing computations using memristive memory arrays, wherein processing occurs inside the memory arrays using a stateful in-memory logic family (e.g., IMPLY [2] and MAGIC [13]). The execution does not necessitate moving data out of the memory arrays if the data is present there already. The other key principle of the proposed PIM model is its reliance on a series of simple operations to compute any complex operation inside the memories (e.g., MAGIC uses simple NOR as the basic operation).

These principles are the foundation of what are currently known as true PIM solutions, which offer advantages such as simplified peripheral circuitry, less reliance on additional external arithmetic units, and lower energy consumption. We base the Bitlet model on true PIM solutions, given their wide applicability and advantages. Although we use MAGIC [13] as an example of a stateful in-memory logic family to illustrate true PIM, our model is also easily extendable to other stateful in-memory logic families. The supporting circuitry and micro-architecture for our PIM model resemble, but are not limited to, those described by Haj-Ali et al. [9].

We derive PIM throughput by considering operation complexity, data placement and alignment issues, and energy efficiency. We start by discussing operation complexity.

Operation Complexity. In the Bitlet model, the PIM computations are carried out as a series of NOR operations, applied on the memory cells of a row inside a memristive memory array. Each row of the memory array stores the input data required for processing. A two-input bit NOR gate processes two data bits within the row and stores the output bit in the same row. Any intermediate data are processed similarly. Processing proceeds sequentially in this fashion to produce the final output, which is also stored within the same row. Data processing as per the Bitlet model is best viewed as row-wise and bit-by-bit within the row of a memory array. We use a default two-input bit NOR gate as the basic logic operation [13], permitting a maximum of two input bits to be processed per memory cycle.

While each row is processed bit-by-bit, the effective throughput of PIM is increased by the inherent parallelism achieved by simultaneous processing of multiple rows inside a memory array and of multiple memory arrays in the system memory. We assume the same computations (i.e., individual operations) applied to a row are also applied in parallel in every cycle across all the rows () of a memory array. This parallelism is made possible by the 2D structure of the memory arrays and by reuse of the voltage signals used to operate an individual row for all the rows. Although the choice to only process row-wise may seem restrictive, it naturally maximizes the data-level parallelism and hence PIM throughput. Moreover, the multiple memory arrays () further maximize this parallelism. Finally, the cycle time, , of a single basic PIM operation also impacts overall PIM performance. The shorter it is, the faster the processing.

Fig. 1: PIM operation complexity in cycles for different types of operations and data sizes. MPY refers to a multiplication operation. Other arithmetic and logic operations are also shown.

Fig. 1 shows how bit lengths () of the input data affect the number of computing cycles required for PIM-based processing. The figure shows that this number is affected by both the data sizes, as well as operation types (different operations follow a different curve on the graph). With this model, for example, n-bit AND requires cycles (e.g., for n=16 bits AND takes 16x3 = 48 cycles), ADD requires cycles111ADD can be improved to cycles using an algorithmic optimization that uses four-input NOR instead of two-input NOR., and multiply (MPY) requires 1314 cycles [8]. We define the operation complexity parameter () for a given operation type and data size, as the number of cycles required to process the corresponding data.

The throughput of PIM is captured by four parameters: , , and (see Table I). The throughput of the system in operations per second can be expressed as:

(1)

Placement and Alignment Complexity. PIM imposes certain constraints on data alignment and placement [22]. To align the data for subsequent row-parallel operations, a series of data alignment and placement steps may be needed. The number of cycles needed to perform these additional steps is captured by the placement and alignment complexity parameter, denoted as . Currently, for simplicity, we focus on modeling the cost of intra-array data movements and assume that multiple memory arrays continue to operate in parallel and independently. We have already observed that PIM performance being quite sensitive to intra-array data movements (Section 3). In the future, we plan to refine the model for inter-array data movements.

The following expression, extends Eq. 1, considering presence of unaligned and misplaced data elements:

(2)

The PAC cycles can, in turn, be broken down into a series of vertical, column-parallel moves and horizontal, row-parallel moves to bring the data in a memory array to the desired locations. While the vertical moves serve to correct the data element misplacements, the horizontal moves take care of unaligned data elements. Given, and as the total number of horizontal and vertical moves needed, respectively, PAC can be said to be equal to ‘’. The horizontal moves are performed bit-by-bit for a given data element, and hence, their count is typically proportional to the size of the data element involved. In most cases, the same alignment is done for all data elements; thus, the same bit moves in parallel in all rows. When the involved data elements across different rows are not aligned, separate horizontal moves need to be made individually for each data element (increasing the cost). A vertical move for a given data element, on the other hand, is parallelizable. However, to cover the many data elements distributed across the rows, many such vertical moves need to be performed serially.

Parameter name Notation Value(s) Type
PIM operation complexity 1 - 32k cycles Algo.
PIM Placement and
Alignment Complexity 0 - 1024x1024 Algo.
PIM cycle time 10 ns [15] Tech.
PIM array dimensions 1024 x 1024 Tech.
PIM array count 1k - 16k Arch.
PIM energy for op (=1) 0.1pJ [15] Tech.
CPU memory bandwidth 1 to 16 Tbps Arch.
CPU data in-out bits 24, 48 Algo.
CPU energy for bit transfer 15pJ [3] Tech.
TABLE I: Bitlet model parameters.

As an example, if , and

are three data elements vectors inside a memory array and the computation requires performing ‘a(i) = b(i+1) + c(i)’, then

in this case is unaligned and also misplaced. For this scenario, each b(i+1) is relocated to t(i) through multiple horizontal moves and a single vertical move. Only after the relocations, the actual computation, which in this case is a(i)=t(i)+c(i), is performed. To relocate b(i+1) to t(i), firstly, horizontal moves occur, which ensures alignment of all b(i+1), each of size , followed by as many as row count number222Here, (ROW-1) moves occur within and 1 out of the MAT. of vertical moves, which takes care of the misplacement of each individual b(i+1) inside each row. Therefore, in this scenario, the PAC is cycles.

Energy Efficiency. The maximum throughput for PIM or the CPU is limited by the thermal design power (TDP). For PIM, the throughput depends on the energy per unit of computation, which is the energy spent for a single computation cycle () for = 1. Building on Eq. 2, we quantify the power-limited (PL) throughput as follows:

(3)

Table I summarizes the PIM-related parameters of the Bitlet model. For conceptual clarity and to aid our analysis, we designate three parameter types: technological, architectural, and algorithmic. Typical values, or the ranges for the different parameters, are also listed in the table.

2.2 Deriving CPU Throughput

Given the objective of the Bitlet model to assess the affinity of workloads or workload phases to PIM versus CPU, the model focuses on workloads (or workload phases) with high memory intensity and relies on a relatively simple CPU model. The overall distinction in modeling between PIM and CPU is described below.

For the workload phases being considered for the Bitlet litmus test, PIM-based computations (as outlined in Section 2.1) occur inside the memory arrays, without any data transfers occurring outside the memory arrays. That is, they are limited by operation complexity and by the data placement and alignment costs. On the other hand, we assume that the CPU throughput is primarily limited by its usage of external memory-bandwidth, i.e., by the cost of data transfers between the CPU and memory, ignoring the cost of computations and data movements performed within the CPU itself.

Data Transfer. The Bitlet model, therefore, derives the CPU throughput assuming both memory bandwidth between the CPU and the memories, and the amount of data transfer needed to perform an operation, as the primary limiting factors. Large amounts of data being transferred between the CPU and the memory result in lower CPU throughput, while smaller volumes produce the opposite effect. The extent of data transfer between the CPU and the memory is captured by the data in-out () model parameter. The is the average amount of data transferred per operation and must account for all the data transfers (in bits) between the CPU and the memory resulting from inputs, outputs, as well as any temporary results. Along with , the external memory bandwidth (denoted as ) between the CPU and the memory determines the final throughput333Memory bandwidth may depend on the number of channels. The CPU throughput, in operations per second, is defined as:

(4)

To support a broader analysis across all types of workloads, including the phases with high CPU arithmetic intensity, a more accurate CPU model will be useful. One possibility is the inclusion of maximum arithmetic throughput as part of Eq. 4, similar to the arithmetic intensity limit of Roofline [23]. We leave extension of the Bitlet model with more detailed CPU-side modeling, for future work.

Energy Efficiency. On the CPU front, the energy per bit transfer between the CPU and the memory determines the efficiency of CPU computations (denoted as and also listed in Table I). We assume that the CPU compute energy is significantly lower than the data transfer energy. This aligns with our focus on identifying the strengths of PIM rather than those of the CPU. The power-limited performance for CPU computation is expressed as:

(5)

Table I summarizes the CPU-related parameters, including typical values or range of values they are set to. We vary the memory bandwidth parameter from 1 to 16 Tbps to show the sensitivity of the model to memory bandwidth.

Fig. 2: Throughput comparison of CPU vs. PIM. A crossover point where the CPU starts performing better than PIM is shown.

3 Applying the Bitlet Model

In this section, we apply the Bitlet model. We start by comparing the throughput of basic operations for PIM versus the CPU and then proceed to compare PIM to CPU under a wider parameter design space.

3.1 PIM vs. CPU - Basic Operations

Below we discuss a few examples to illustrate the use of the Bitlet model. Note that although we only compare PIM to the CPU, the model assumptions and the comparisons can be easily extended to GPUs as well.

PIM (16-bit) ADD, OR and MPY. Consider an ADD operation which adds two 16-bit inputs and produces a 16-bit output and assuming all data elements are perfectly aligned. This operation on a data element takes 144 cycles ( = 144, 9 where =16). Assuming there are 1024 MATs and each MAT supports 1024 data elements (rows= # data elements), the achieved throughput = (1024x1024)/(144x10) = 728 GOPS. Now consider a 16-bit OR operation, that has two 16-bit inputs and produces a 16-bit output. In this case = 32 (2, where =16) and the throughput = (1024x1024)/(32x10) = 3276 GOPS. Finally, consider a 16-bit MPY (multiplication) producing a 32-bit result. In this case, = 3104 (13- 14, where =16). Here, the throughput is (1024x1024)/(3104x10) = 33 GOPS. For low-precision multiplication that produces only a 16-bit output, OC = 1544 and the throughput is (1024x1024)/(1544x10) = 67 GOPS.

CPU (16-bit) ANY. We consider ‘any’ binary operation that operates on two 16-bit inputs and produces 16-bit output (e.g. 16-bit ADD, 16-bit OR and 16-bit MPY with low-precision). The DIO is thus (16x2+16) = 48 bits444DIO = 24, for two 8-bit inputs and one 8-bit output.. For any of these operations, the effective throughput of the CPU is 4Tbps/48 = 85 GOPS. For an OR operation, the CPU is inferior to PIM, which benefits, in this scenario, from lower operation complexity, high data parallelism, and obliviousness to external memory bandwidth. For MPY, on the other hand, PIM is inferior to the CPU due to the higher operation complexity. If the memory bandwidth is reduced to 1024 Gbps, the CPU throughput becomes 1Tbps/48 = 21 GOPS for any 16-bit binary operation, with a 16-bit output. Since memory bandwidth is the main limiter here, CPU throughput becomes worse than PIM even for MPY.

3.2 PIM vs. CPU - Impact of Model Parameters

PIM’s throughput is sensitive to various Bitlet model parameters. In this section, a sensitivity study performed to assess these model parameters, highlights some of the strengths and weaknesses of the new PIM paradigm.

Operational Complexity Impact. Fig. 2 shows the throughput of PIM versus that of the CPU and assumes PAC = 0. Diagonal lines represent PIM with varying numbers of (set to 1/16/256/1024/4096/16384 ). A single 10241024 memory array has a 128 KB capacity. Horizontal lines are for CPUs with varying bits (set to 24/48) along with = 1Tbps/4Tbps/16Tbps.

Using Eq. 1, we observe that PIM throughput increases with maximum MAT availability, peaking when maximum available memory arrays are used ( = 16k), and the operation complexity is the lowest possible ( = 1). In parallel, PIM throughput decreases with increasing operation complexity. We see that the CPU throughput decreases with higher . For instance, consider the lines shown for = 24 and = 48 for the same = 1Tbps. The CPU’s performance for = 48 is lower than for = 24.

For a configuration of = 1024, = 24 and = 4Tbps, the CPU performs better than PIM at = 612 or higher. This marks the crossover point and sets the boundaries of a favorable region for PIM for this configuration. Note the placement of the OR, AND and MPY operations shown in Fig. 2 along the x-axis. Clearly, OR ( = 32) and ADD ( = 144) are located to the left of the crossover point and MPY ( = 3104) is to the right. The left region is where PIM is superior, and the right region is where CPU is superior.

The crossover point shifts to the right for different values. For instance, for = 1024 and = 1024, the crossover point shifts roughly from = 2500 to = 5000 for = 24 to = 48, respectively. Thus, it is the algorithmic interplay of and (along with other technological and architectural factors) that determines the throughput of PIM relative to that of CPU computing.

Fig. 3: Throughput comparison of CPU vs. PIM under power limits. Shows no. of PIM MATs permissible under a power limit.

Placement and Alignment Complexity Impact. Under perfect data layout, = 0, and therefore there is no PIM performance loss due to placement and alignment issues. Sometimes, however, either horizontal or vertical moves, or both, become necessary for placement and alignment reasons, which result in some performance loss. Below we discuss the nature and the magnitude of these losses, captured through the parameter and Eq. 2.

Consider the previous example ‘a(i)=b(i+1)+c(i)’ (assuming 16-bit addition) which necessitates both horizontal and vertical moves. In this example, = 16+1024 = 1040 and following Eq. 2, the throughput is (1024x1024)/((144+1040)x10) = 88 GOPS. Effectively the throughput shrinks to 12% of its original value of 728 GOPS (as per Eq. 1). Now consider a scenario requiring just the horizontal moves for alignment. In this case, prior to the additions, 16 horizontal moves are needed, i.e., = 16. In this case, the throughput is 655 GOPS, just a 10% performance loss relative to 728 GOPS. Finally, if a subset of data elements has to be aligned separately, then can be stated as where is the number of such subsets. A higher implies higher losses in throughput. With increasing costs, there is a lower impetus for processing using PIM (instead of processing on the CPU) unless the cost is amortized over time. Additionally, the trade-offs may shift with the number of rows in a memory array and with any additional hardware support available for fast relocation (e.g., parallel vertical moves). Finally, while Fig. 2 showed CPU vs. PIM trade-offs assuming PAC is zero, a higher PAC too, just like OC, will adversely impact PIM’s performance.

Energy Efficiency Impact. As shown in Fig. 3, and based on Eq. 3, a maximum of 1950 can be accommodated for PIM at the power envelope of 20W. Increasing the number of MATs does not further increase the throughput, since the power budget of the system is the main limiter. For example, at 40W up to 3900 memory arrays (MATs) can be active at any given time.

For the CPU, the energy cost of data transfer limits the PL-Throughput-CPU. Here, we assume a = 16Tbps. With a power limitation of 20W, the CPU delivers 55 GOPS at = 24. At a power budget of 40W, 111 GOPS are possible, and 444 GOPS at 160W. Compare this against the raw (with no power limitation) CPU throughput, which is 682 GOPS at a DIO of 24.

The values of the energy parameters and affect the relative energy efficiency of PIM versus the CPU. For example, consider the case of a single-bit NOR operation, where = 1 (a single MAGIC operation) and DIO = 3 (2 input and 1 output bits). In this case, PIM consumes 1x = 0.1pJ while the CPU consumes 3x = 45pJ. For this example, the CPU energy consumption is approximately 450X higher than that of PIM. However, as increases, the relative efficiency of PIM decreases. For the limiting case of = 7200 or higher (720pJ/0.1pJ = 7200), PIM becomes less attractive than the CPU with respect to energy efficiency. However, note that differences in energy parameter values will affect the relative merits of PIM or CPU.

The above examples assume PAC = 0 for simplicity, but in reality a non-zero PAC will lead to higher energy consumption per operation, effectively reducing PIM’s energy efficiency advantage over the CPU. We leave further analysis of these and other model parameters for future work.

4 Conclusions

This paper motivates and describes Bitlet, an analytical model for PIM. We show how to use the model to find the cases in which PIM is beneficial and to understand the related trade-offs and limits, in a parameterized fashion. We hope the model will shed more light on the new bitwise-PIM paradigm.

5 Acknowledgement

This work was supported by the European Research Council through the European Union’s Horizon 2020 Research and Innovation Programme under Grant 757259 and by the Israel Science Foundation under Grant 1514/17.

References

  • [1] D. Bhattacharjee et al. Revamp: RERAM based VLIW arch. for in-mem. comp., DATE 2017.
  • [2] J. Borghetti et al., Memristive switches enable ‘stateful’ logic operations via material implication,” Nature, vol. 464, pp. 873–876, 2010.
  • [3] O’Connor at al., Fine-grained DRAM: energy-eff. DRAM for extr. band. sys., MICRO 2017.
  • [4] C. Eckert at al., Neural cache: Bit-serial in-cache acc. of deep neural netw., ISCA 2018.
  • [5] H. Esmaeilzadeh et al., Dark silicon and the end of multicore scaling, in IEEE Micro, vol. 32, no. 3, 2012.
  • [6] D. Fujiki et al., In-memory data parallel processor, ASPLOS 2018.
  • [7] J. L. Gustafson, Reevaluating Amdahl’s law, Commun. ACM, vol. 31, no. 5, 1988.
  • [8] A. Haj-Ali et al., Imaging: In-mem. algo. for image proc., IEEE TCAS I, vol. 65-12, 2018.
  • [9] A. Haj-Ali et al., Not in name alone: A memristive mem. proc. unit for real in-mem. processing, IEEE Micro, vol. 38, no. 5, 2018.
  • [10] M. Hill et al., Amdahl’s law in the multicore era, IEEE Computer, vol. 41, no. 7, 2008.
  • [11] M. Hill et al., Gables: A roofline model for mobile SOCs, HPCA 2019.
  • [12] M. Imani et al. Ultra-Efficient Processing In-Memory for Data Inten. App., DAC 2017.
  • [13] S. Kvatinsky et al., Magic: memristor-aided logic, IEEE TCAS II 2014.
  • [14] S. Kvatinsky et al., Memristor-based mater. impl. (imply) logic, TVLSI, vol. 22-10, 2014.
  • [15] M. Lanza et al., Recom. methods to study resistive switching dev., Adv. Elec. Materials, vol. 5, no. 1, 2019.
  • [16] Y. Levy et al., Logic ops. in mem. using a memris. Akers arr., Microelec. J., vol. 45, no. 11, 2014.
  • [17] E. Linn et al., Beyond von Neumann logic operations in passive crossbar arrays along side memory operations, Nanotechnology, vol. 23, no. 30, 2012.
  • [18] P. Ranganathan, From micro. to nanostores: Rethinking data-centric sys., IEEE Comp., vol. 44, no. 1, 2011.
  • [19] S. Raoux et al., Phase-change random access mem.: A scalable tech., IBM J. Res. Dev., vol. 52, no. 4, 2008.
  • [20] S. Seshadri et al., Willow: A user-programmable SSD, OSDI 2014.
  • [21] V. Seshadri et al., Ambit: In-mem. acc. for bulk bitwise ops. using commodity dram tech., MICRO 2017.
  • [22] N. Talati et al., Practical chal. in delivering the promises of real PIM machines, DATE 2018.
  • [23] S. Williams et al., Roofline: An insightful vis. perf. model for multicore arch., CACM, vol. 52, no. 4, 2009.
  • [24] H. S. P. Wong et al., Metal oxide RRAM, Proc. IEEE, vol. 100, no. 6, 2012.