1 Introduction
Processing huge amounts of data on traditional von Neumann architectures, involves many data transfers between the CPU and the memory. These transfers degrade performance and consume energy [18, 20, 21, 6, 4]. Enabled by emerging memory technologies, recent processinginmemory (PIM) solutions show great potential in reducing costly data transfers by performing computations using individual memory cells [19, 2, 24, 17, 13]. This line of research has led to better circuits and microarchitectures [13, 14, 1], as well as applications using this paradigm [12, 8].
Despite the recent resurgence of PIM, it is still very challenging to analyze and quantify the advantages or disadvantages of PIM solutions over other computing paradigms. We believe a useful analytical modeling tool for PIM can play a crucial role. An analytical tool in this context has many potential uses, such as in (i) evaluation of applications mapped to PIM, (ii) comparison of PIM versus traditional architectures, and (iii) analysis of the implications of new memory technology trends on PIM.
Our Bitlet model is an analytical modeling tool that addresses the challenge of better understanding PIM relative to traditional CPU/GPU computing. The name Bitlet reflects PIM’s unique bitbybit data element processing approach. The model is inspired by past successful analytical models for computing [7, 10, 23, 5, 11] and provides a simple operational view of PIM computations.
The main contributions of this work are:

Presentation of the Bitlet model, an analytical modeling tool that abstracts algorithmic, technological as well as architectural machine parameters for PIM.

Definition of a litmus test for workloads to assess their affinity on PIM as compared to the CPU.

Delineation of the strengths and weaknesses of the new PIM paradigm as observed in a sensitivity study evaluating PIM performance and efficiency over various Bitlet model parameters.
2 The Bitlet Model
We derive a parameterized throughput metric for PIM followed by one for the CPU. The throughput focus is in alignment with the parallelism that the PIM approach offers. We first describe the model and then proceed to explain how to apply it. Throughout the paper, we refer to ‘PIM’ as a framework for processing inside memories.
2.1 Deriving PIM Throughput
We based the PIM side of the Bitlet model on the principle of performing computations using memristive memory arrays, wherein processing occurs inside the memory arrays using a stateful inmemory logic family (e.g., IMPLY [2] and MAGIC [13]). The execution does not necessitate moving data out of the memory arrays if the data is present there already. The other key principle of the proposed PIM model is its reliance on a series of simple operations to compute any complex operation inside the memories (e.g., MAGIC uses simple NOR as the basic operation).
These principles are the foundation of what are currently known as true PIM solutions, which offer advantages such as simplified peripheral circuitry, less reliance on additional external arithmetic units, and lower energy consumption. We base the Bitlet model on true PIM solutions, given their wide applicability and advantages. Although we use MAGIC [13] as an example of a stateful inmemory logic family to illustrate true PIM, our model is also easily extendable to other stateful inmemory logic families. The supporting circuitry and microarchitecture for our PIM model resemble, but are not limited to, those described by HajAli et al. [9].
We derive PIM throughput by considering operation complexity, data placement and alignment issues, and energy efficiency. We start by discussing operation complexity.
Operation Complexity. In the Bitlet model, the PIM computations are carried out as a series of NOR operations, applied on the memory cells of a row inside a memristive memory array. Each row of the memory array stores the input data required for processing. A twoinput bit NOR gate processes two data bits within the row and stores the output bit in the same row. Any intermediate data are processed similarly. Processing proceeds sequentially in this fashion to produce the final output, which is also stored within the same row. Data processing as per the Bitlet model is best viewed as rowwise and bitbybit within the row of a memory array. We use a default twoinput bit NOR gate as the basic logic operation [13], permitting a maximum of two input bits to be processed per memory cycle.
While each row is processed bitbybit, the effective throughput of PIM is increased by the inherent parallelism achieved by simultaneous processing of multiple rows inside a memory array and of multiple memory arrays in the system memory. We assume the same computations (i.e., individual operations) applied to a row are also applied in parallel in every cycle across all the rows () of a memory array. This parallelism is made possible by the 2D structure of the memory arrays and by reuse of the voltage signals used to operate an individual row for all the rows. Although the choice to only process rowwise may seem restrictive, it naturally maximizes the datalevel parallelism and hence PIM throughput. Moreover, the multiple memory arrays () further maximize this parallelism. Finally, the cycle time, , of a single basic PIM operation also impacts overall PIM performance. The shorter it is, the faster the processing.
Fig. 1 shows how bit lengths () of the input data affect the number of computing cycles required for PIMbased processing. The figure shows that this number is affected by both the data sizes, as well as operation types (different operations follow a different curve on the graph). With this model, for example, nbit AND requires cycles (e.g., for n=16 bits AND takes 16x3 = 48 cycles), ADD requires cycles^{1}^{1}1ADD can be improved to cycles using an algorithmic optimization that uses fourinput NOR instead of twoinput NOR., and multiply (MPY) requires 1314 cycles [8]. We define the operation complexity parameter () for a given operation type and data size, as the number of cycles required to process the corresponding data.
The throughput of PIM is captured by four parameters: , , and (see Table I). The throughput of the system in operations per second can be expressed as:
(1) 
Placement and Alignment Complexity. PIM imposes certain constraints on data alignment and placement [22]. To align the data for subsequent rowparallel operations, a series of data alignment and placement steps may be needed. The number of cycles needed to perform these additional steps is captured by the placement and alignment complexity parameter, denoted as . Currently, for simplicity, we focus on modeling the cost of intraarray data movements and assume that multiple memory arrays continue to operate in parallel and independently. We have already observed that PIM performance being quite sensitive to intraarray data movements (Section 3). In the future, we plan to refine the model for interarray data movements.
The following expression, extends Eq. 1, considering presence of unaligned and misplaced data elements:
(2) 
The PAC cycles can, in turn, be broken down into a series of vertical, columnparallel moves and horizontal, rowparallel moves to bring the data in a memory array to the desired locations. While the vertical moves serve to correct the data element misplacements, the horizontal moves take care of unaligned data elements. Given, and as the total number of horizontal and vertical moves needed, respectively, PAC can be said to be equal to ‘’. The horizontal moves are performed bitbybit for a given data element, and hence, their count is typically proportional to the size of the data element involved. In most cases, the same alignment is done for all data elements; thus, the same bit moves in parallel in all rows. When the involved data elements across different rows are not aligned, separate horizontal moves need to be made individually for each data element (increasing the cost). A vertical move for a given data element, on the other hand, is parallelizable. However, to cover the many data elements distributed across the rows, many such vertical moves need to be performed serially.
Parameter name  Notation  Value(s)  Type 

PIM operation complexity  1  32k cycles  Algo.  
PIM Placement and  
Alignment Complexity  0  1024x1024  Algo.  
PIM cycle time  10 ns [15]  Tech.  
PIM array dimensions  1024 x 1024  Tech.  
PIM array count  1k  16k  Arch.  
PIM energy for op (=1)  0.1pJ [15]  Tech.  
CPU memory bandwidth  1 to 16 Tbps  Arch.  
CPU data inout bits  24, 48  Algo.  
CPU energy for bit transfer  15pJ [3]  Tech. 
As an example, if , and
are three data elements vectors inside a memory array and the computation requires performing ‘a(i) = b(i+1) + c(i)’, then
in this case is unaligned and also misplaced. For this scenario, each b(i+1) is relocated to t(i) through multiple horizontal moves and a single vertical move. Only after the relocations, the actual computation, which in this case is a(i)=t(i)+c(i), is performed. To relocate b(i+1) to t(i), firstly, horizontal moves occur, which ensures alignment of all b(i+1), each of size , followed by as many as row count number^{2}^{2}2Here, (ROW1) moves occur within and 1 out of the MAT. of vertical moves, which takes care of the misplacement of each individual b(i+1) inside each row. Therefore, in this scenario, the PAC is cycles.Energy Efficiency. The maximum throughput for PIM or the CPU is limited by the thermal design power (TDP). For PIM, the throughput depends on the energy per unit of computation, which is the energy spent for a single computation cycle () for = 1. Building on Eq. 2, we quantify the powerlimited (PL) throughput as follows:
(3) 
Table I summarizes the PIMrelated parameters of the Bitlet model. For conceptual clarity and to aid our analysis, we designate three parameter types: technological, architectural, and algorithmic. Typical values, or the ranges for the different parameters, are also listed in the table.
2.2 Deriving CPU Throughput
Given the objective of the Bitlet model to assess the affinity of workloads or workload phases to PIM versus CPU, the model focuses on workloads (or workload phases) with high memory intensity and relies on a relatively simple CPU model. The overall distinction in modeling between PIM and CPU is described below.
For the workload phases being considered for the Bitlet litmus test, PIMbased computations (as outlined in Section 2.1) occur inside the memory arrays, without any data transfers occurring outside the memory arrays. That is, they are limited by operation complexity and by the data placement and alignment costs. On the other hand, we assume that the CPU throughput is primarily limited by its usage of external memorybandwidth, i.e., by the cost of data transfers between the CPU and memory, ignoring the cost of computations and data movements performed within the CPU itself.
Data Transfer. The Bitlet model, therefore, derives the CPU throughput assuming both memory bandwidth between the CPU and the memories, and the amount of data transfer needed to perform an operation, as the primary limiting factors. Large amounts of data being transferred between the CPU and the memory result in lower CPU throughput, while smaller volumes produce the opposite effect. The extent of data transfer between the CPU and the memory is captured by the data inout () model parameter. The is the average amount of data transferred per operation and must account for all the data transfers (in bits) between the CPU and the memory resulting from inputs, outputs, as well as any temporary results. Along with , the external memory bandwidth (denoted as ) between the CPU and the memory determines the final throughput^{3}^{3}3Memory bandwidth may depend on the number of channels. The CPU throughput, in operations per second, is defined as:
(4) 
To support a broader analysis across all types of workloads, including the phases with high CPU arithmetic intensity, a more accurate CPU model will be useful. One possibility is the inclusion of maximum arithmetic throughput as part of Eq. 4, similar to the arithmetic intensity limit of Roofline [23]. We leave extension of the Bitlet model with more detailed CPUside modeling, for future work.
Energy Efficiency. On the CPU front, the energy per bit transfer between the CPU and the memory determines the efficiency of CPU computations (denoted as and also listed in Table I). We assume that the CPU compute energy is significantly lower than the data transfer energy. This aligns with our focus on identifying the strengths of PIM rather than those of the CPU. The powerlimited performance for CPU computation is expressed as:
(5) 
Table I summarizes the CPUrelated parameters, including typical values or range of values they are set to. We vary the memory bandwidth parameter from 1 to 16 Tbps to show the sensitivity of the model to memory bandwidth.
3 Applying the Bitlet Model
In this section, we apply the Bitlet model. We start by comparing the throughput of basic operations for PIM versus the CPU and then proceed to compare PIM to CPU under a wider parameter design space.
3.1 PIM vs. CPU  Basic Operations
Below we discuss a few examples to illustrate the use of the Bitlet model. Note that although we only compare PIM to the CPU, the model assumptions and the comparisons can be easily extended to GPUs as well.
PIM (16bit) ADD, OR and MPY. Consider an ADD operation which adds two 16bit inputs and produces a 16bit output and assuming all data elements are perfectly aligned. This operation on a data element takes 144 cycles ( = 144, 9 where =16). Assuming there are 1024 MATs and each MAT supports 1024 data elements (rows= # data elements), the achieved throughput = (1024x1024)/(144x10) = 728 GOPS. Now consider a 16bit OR operation, that has two 16bit inputs and produces a 16bit output. In this case = 32 (2, where =16) and the throughput = (1024x1024)/(32x10) = 3276 GOPS. Finally, consider a 16bit MPY (multiplication) producing a 32bit result. In this case, = 3104 (13 14, where =16). Here, the throughput is (1024x1024)/(3104x10) = 33 GOPS. For lowprecision multiplication that produces only a 16bit output, OC = 1544 and the throughput is (1024x1024)/(1544x10) = 67 GOPS.
CPU (16bit) ANY. We consider ‘any’ binary operation that operates on two 16bit inputs and produces 16bit output (e.g. 16bit ADD, 16bit OR and 16bit MPY with lowprecision). The DIO is thus (16x2+16) = 48 bits^{4}^{4}4DIO = 24, for two 8bit inputs and one 8bit output.. For any of these operations, the effective throughput of the CPU is 4Tbps/48 = 85 GOPS. For an OR operation, the CPU is inferior to PIM, which benefits, in this scenario, from lower operation complexity, high data parallelism, and obliviousness to external memory bandwidth. For MPY, on the other hand, PIM is inferior to the CPU due to the higher operation complexity. If the memory bandwidth is reduced to 1024 Gbps, the CPU throughput becomes 1Tbps/48 = 21 GOPS for any 16bit binary operation, with a 16bit output. Since memory bandwidth is the main limiter here, CPU throughput becomes worse than PIM even for MPY.
3.2 PIM vs. CPU  Impact of Model Parameters
PIM’s throughput is sensitive to various Bitlet model parameters. In this section, a sensitivity study performed to assess these model parameters, highlights some of the strengths and weaknesses of the new PIM paradigm.
Operational Complexity Impact. Fig. 2 shows the throughput of PIM versus that of the CPU and assumes PAC = 0. Diagonal lines represent PIM with varying numbers of (set to 1/16/256/1024/4096/16384 ). A single 10241024 memory array has a 128 KB capacity. Horizontal lines are for CPUs with varying bits (set to 24/48) along with = 1Tbps/4Tbps/16Tbps.
Using Eq. 1, we observe that PIM throughput increases with maximum MAT availability, peaking when maximum available memory arrays are used ( = 16k), and the operation complexity is the lowest possible ( = 1). In parallel, PIM throughput decreases with increasing operation complexity. We see that the CPU throughput decreases with higher . For instance, consider the lines shown for = 24 and = 48 for the same = 1Tbps. The CPU’s performance for = 48 is lower than for = 24.
For a configuration of = 1024, = 24 and = 4Tbps, the CPU performs better than PIM at = 612 or higher. This marks the crossover point and sets the boundaries of a favorable region for PIM for this configuration. Note the placement of the OR, AND and MPY operations shown in Fig. 2 along the xaxis. Clearly, OR ( = 32) and ADD ( = 144) are located to the left of the crossover point and MPY ( = 3104) is to the right. The left region is where PIM is superior, and the right region is where CPU is superior.
The crossover point shifts to the right for different values. For instance, for = 1024 and = 1024, the crossover point shifts roughly from = 2500 to = 5000 for = 24 to = 48, respectively. Thus, it is the algorithmic interplay of and (along with other technological and architectural factors) that determines the throughput of PIM relative to that of CPU computing.
Placement and Alignment Complexity Impact. Under perfect data layout, = 0, and therefore there is no PIM performance loss due to placement and alignment issues. Sometimes, however, either horizontal or vertical moves, or both, become necessary for placement and alignment reasons, which result in some performance loss. Below we discuss the nature and the magnitude of these losses, captured through the parameter and Eq. 2.
Consider the previous example ‘a(i)=b(i+1)+c(i)’ (assuming 16bit addition) which necessitates both horizontal and vertical moves. In this example, = 16+1024 = 1040 and following Eq. 2, the throughput is (1024x1024)/((144+1040)x10) = 88 GOPS. Effectively the throughput shrinks to 12% of its original value of 728 GOPS (as per Eq. 1). Now consider a scenario requiring just the horizontal moves for alignment. In this case, prior to the additions, 16 horizontal moves are needed, i.e., = 16. In this case, the throughput is 655 GOPS, just a 10% performance loss relative to 728 GOPS. Finally, if a subset of data elements has to be aligned separately, then can be stated as where is the number of such subsets. A higher implies higher losses in throughput. With increasing costs, there is a lower impetus for processing using PIM (instead of processing on the CPU) unless the cost is amortized over time. Additionally, the tradeoffs may shift with the number of rows in a memory array and with any additional hardware support available for fast relocation (e.g., parallel vertical moves). Finally, while Fig. 2 showed CPU vs. PIM tradeoffs assuming PAC is zero, a higher PAC too, just like OC, will adversely impact PIM’s performance.
Energy Efficiency Impact. As shown in Fig. 3, and based on Eq. 3, a maximum of 1950 can be accommodated for PIM at the power envelope of 20W. Increasing the number of MATs does not further increase the throughput, since the power budget of the system is the main limiter. For example, at 40W up to 3900 memory arrays (MATs) can be active at any given time.
For the CPU, the energy cost of data transfer limits the PLThroughputCPU. Here, we assume a = 16Tbps. With a power limitation of 20W, the CPU delivers 55 GOPS at = 24. At a power budget of 40W, 111 GOPS are possible, and 444 GOPS at 160W. Compare this against the raw (with no power limitation) CPU throughput, which is 682 GOPS at a DIO of 24.
The values of the energy parameters and affect the relative energy efficiency of PIM versus the CPU. For example, consider the case of a singlebit NOR operation, where = 1 (a single MAGIC operation) and DIO = 3 (2 input and 1 output bits). In this case, PIM consumes 1x = 0.1pJ while the CPU consumes 3x = 45pJ. For this example, the CPU energy consumption is approximately 450X higher than that of PIM. However, as increases, the relative efficiency of PIM decreases. For the limiting case of = 7200 or higher (720pJ/0.1pJ = 7200), PIM becomes less attractive than the CPU with respect to energy efficiency. However, note that differences in energy parameter values will affect the relative merits of PIM or CPU.
The above examples assume PAC = 0 for simplicity, but in reality a nonzero PAC will lead to higher energy consumption per operation, effectively reducing PIM’s energy efficiency advantage over the CPU. We leave further analysis of these and other model parameters for future work.
4 Conclusions
This paper motivates and describes Bitlet, an analytical model for PIM. We show how to use the model to find the cases in which PIM is beneficial and to understand the related tradeoffs and limits, in a parameterized fashion. We hope the model will shed more light on the new bitwisePIM paradigm.
5 Acknowledgement
This work was supported by the European Research Council through the European Union’s Horizon 2020 Research and Innovation Programme under Grant 757259 and by the Israel Science Foundation under Grant 1514/17.
References
 [1] D. Bhattacharjee et al. Revamp: RERAM based VLIW arch. for inmem. comp., DATE 2017.
 [2] J. Borghetti et al., Memristive switches enable ‘stateful’ logic operations via material implication,” Nature, vol. 464, pp. 873–876, 2010.
 [3] O’Connor at al., Finegrained DRAM: energyeff. DRAM for extr. band. sys., MICRO 2017.
 [4] C. Eckert at al., Neural cache: Bitserial incache acc. of deep neural netw., ISCA 2018.
 [5] H. Esmaeilzadeh et al., Dark silicon and the end of multicore scaling, in IEEE Micro, vol. 32, no. 3, 2012.
 [6] D. Fujiki et al., Inmemory data parallel processor, ASPLOS 2018.
 [7] J. L. Gustafson, Reevaluating Amdahl’s law, Commun. ACM, vol. 31, no. 5, 1988.
 [8] A. HajAli et al., Imaging: Inmem. algo. for image proc., IEEE TCAS I, vol. 6512, 2018.
 [9] A. HajAli et al., Not in name alone: A memristive mem. proc. unit for real inmem. processing, IEEE Micro, vol. 38, no. 5, 2018.
 [10] M. Hill et al., Amdahl’s law in the multicore era, IEEE Computer, vol. 41, no. 7, 2008.
 [11] M. Hill et al., Gables: A roofline model for mobile SOCs, HPCA 2019.
 [12] M. Imani et al. UltraEfficient Processing InMemory for Data Inten. App., DAC 2017.
 [13] S. Kvatinsky et al., Magic: memristoraided logic, IEEE TCAS II 2014.
 [14] S. Kvatinsky et al., Memristorbased mater. impl. (imply) logic, TVLSI, vol. 2210, 2014.
 [15] M. Lanza et al., Recom. methods to study resistive switching dev., Adv. Elec. Materials, vol. 5, no. 1, 2019.
 [16] Y. Levy et al., Logic ops. in mem. using a memris. Akers arr., Microelec. J., vol. 45, no. 11, 2014.
 [17] E. Linn et al., Beyond von Neumann logic operations in passive crossbar arrays along side memory operations, Nanotechnology, vol. 23, no. 30, 2012.
 [18] P. Ranganathan, From micro. to nanostores: Rethinking datacentric sys., IEEE Comp., vol. 44, no. 1, 2011.
 [19] S. Raoux et al., Phasechange random access mem.: A scalable tech., IBM J. Res. Dev., vol. 52, no. 4, 2008.
 [20] S. Seshadri et al., Willow: A userprogrammable SSD, OSDI 2014.
 [21] V. Seshadri et al., Ambit: Inmem. acc. for bulk bitwise ops. using commodity dram tech., MICRO 2017.
 [22] N. Talati et al., Practical chal. in delivering the promises of real PIM machines, DATE 2018.
 [23] S. Williams et al., Roofline: An insightful vis. perf. model for multicore arch., CACM, vol. 52, no. 4, 2009.
 [24] H. S. P. Wong et al., Metal oxide RRAM, Proc. IEEE, vol. 100, no. 6, 2012.
Comments
There are no comments yet.