Eva-CiM: A System-Level Performance and Energy Evaluation Framework for Computing-in-Memory Architectures

01/27/2019 ∙ by Di Gao, et al. ∙ Zhejiang University 0

Computing-in-Memory (CiM) architectures aim to reduce costly data transfers by performing arithmetic and logic operations in memory and hence relieve the pressure due to the memory wall. However, determining whether a given workload can really benefit from CiM, which memory hierarchy and what device technology should be adopted by a CiM architecture requires in-depth study that is not only time consuming but also demands significant expertise in architectures and compilers. This paper presents an energy evaluation framework, Eva-CiM, for systems based on CiM architectures. Eva-CiM encompasses a multi-level (from device to architecture) comprehensive tool chain by leveraging existing modeling and simulation tools such as GEM5, McPAT [2] and DESTINY [3]. To support high-confidence prediction, rapid design space exploration and ease of use, Eva-CiM introduces several novel modeling/analysis approaches including models for capturing memory access and dependency-aware ISA traces, and for quantifying interactions between the host CPU and CiM modules. Eva-CiM can readily produce energy estimates of the entire system for a given program, a processor architecture, and the CiM array and technology specifications. Eva-CiM is validated by comparing with DESTINY [3] and [4], and enables findings including practical contributions from CiM-supported accesses, CiM-sensitive benchmarking as well as the pros and cons of increased memory size for CiM. Eva-CiM also enables exploration over different configurations and device technologies, showing 1.3-6.0X energy improvement for SRAM and 2.0-7.9X for FeFET-RAM, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid growth of Internet of Things (IoT), the era of ”Big Data” is upon us and features massive data transfers between processor and memory [29]. The efficiency of the conventional Von Neumann architecture is severely restricted by its limited bandwidth and increasingly complex interconnects, which result in significant energy and latency overhead for data movement. For instance, the energy spent on transferring 256 bits from main memory to the processor is estimated to be 200 higher than the energy for one floating point operation [12].

Researchers have long been aware of the inefficiency of conventional architectures for data movements [11, 30], and spent significant efforts on trying to bring computation closer to (or even inside) memory (e.g., [14, 13, 16]. Such architectural designs that places logic and memory close to each other are often referred to as near-memory computing (NMC), processing in memory (PIM), or computing in memory (CiM). NMC reduces the energy and latency associated with memory accesses by placing processing units close to the memory. PIM or CiM, often found to have interchangeable meanings in the literature, are architectures that integrate certain logic and arithmetic operations directly in either the memory cells or memory peripherals, in order to lower the number of memory references made by the processor. For the sake of conciseness, from this point on, we simply use CiM to refer to the PIM/CiM architectures defined above.

Recent works (, [23, 19, 17, 32, 36, 35, 18, 31, 20, 21, 24, 37]) in both CMOS SRAM and emerging non-volatile memories (NVMs) have demonstrated various CiM designs at different levels of memory hierarchy. These designs allow computation to occur exactly where data resides, thereby reducing energy and performance overheads associated with data movement. For example, cache based CiM in [20] achieves 2.4-9 energy saving on text processing scenarios. Meanwhile, NVM based designs such as [23, 31, 24]

can improve energy saving up to two orders of magnitude when functioning as a co-processor on neural network benchmarks.

While CiM is found to be a powerful and promising alternative with various design options, such variety also complicates the design process. Designers are confronted with several important questions when designing CiM:

  • How much can an application program benefit from a CiM based system?

  • At which level of memory hierarchy should one place the CiM?

  • Which technology should be used for CiM?

There are some prior efforts attempting to address the above questions. However, they suffer limitations in several aspects.

Overall system evaluation: Most CiM works ([24, 4, 3]) focus on the CiM module without considering the host CPU or use an emulation platform consisting of a simple host CPU [23]. Interactions between the host and the CiM as well as the complete memory system can be rather complex and the impact on the energy/performance of the overall system can be quite significant [44].

Offloading candidate identification: Offloading candidate here refers to a code snippet, a function or an instruction that can be offloaded from the host CPU to the CiM module for execution. Most prior solutions do not provide instruction set architecture (ISA) nor compiler support to automatically determine offloading candidates. Designers have to either manually identify the code snippets for CiM from the entire benchmark (,  [35, 31, 20]), or select specific instruction groups for offloading to the CiM unit for execution (,  [23, 24]). In the latter two works, memory accesses triggered by a CiM module with custom instructions are identified at compiling time. Two operands fetched from the same level of memory are replaced by one CiM instruction. The method cannot be generalized to systems with multi-level caches as it assumes ideal locality and dependence.

Multi-level modeling: CiMs based on different devices, circuits and micro-architectures have been proposed. However, there is no uniform framework to compare design options at different levels. Though some existing work such as [24] has compared CiMs implemented with different technologies, the comparisons are hand crafted and cannot be easily adapted to different memory hierarchies.

In this paper, we present an architectural evaluation framework, Eva-CiM, that overcomes the above limitations and is able to reliably predict energy consumption and performance of any systems containing a CiM module. The major contributions of this work are summarized below.

We propose a novel trace-driven analysis method to extract data dependencies and identify offloading candidates. The method is built on an instruction dependency graph model augmented with memory access information. The analyzer is integrated into GEM5 [38] and hence can readily work with different architectures, compilers and development options.

We leverage a comprehensive tool chain from device to architecture to build a multi-level CiM model. We employ GEM5 as the backbone of the framework to fully capture the effects of both the host CPU and the complete memory hierarchy. We further design and embed a probe-based simulation inside GEM5 to collect the necessary information for offloading candidate selection at the application layer. We extend McPAT [39] by including a CiM module obtained from SPICE [42] and DESTINY [33] to provide the architectural-level energy profiling capability. Eva-CiM is validated by comparing with DESTINY [33] and [23].

We employ Eva-CiM to investigate the three questions raised earlier. Unlike prior works which typically assume ideal data locality and regular memory accesses, Eva-CiM is able to find operations that are offloadable to CiM under realistic architecture and compiler settings and thus avoid being overly optimistic. Furthermore, we use Eva-CiM to quantitatively estimate the system speedup, evaluate the energy saving of CiM not only due to the reduced memory accesses but also the lower computational loads on the host. Last but not least, we conduct design space explorations on different technologies, system configurations to illustrate the design options that maximize the CiM benefits for a set of benchmarks.

Eva-CiM enables us to make the following findings that are either different from or have not been seen in the conclusions presented in prior works: (i) In a general purpose CiM based system with complete memory hierarchy, the number of possible CiM-supported accesses is similar as the one for regular access; (ii) Data intensive benchmarks are not necessarily always CiM sensitive. The sensitiveness depends on both benchmark characteristics and system architecture; (iii) Energy wise, larger memory size is not necessarily helpful for CiM due to the increased energy per CiM operation.

Ii Background and Related Works

Below, we review the backgrounds on CiM and existing works for CiM modeling and estimation.

Ii-a Computing in Memory

To address the performance gap between processing and memory access, there have been significant efforts aiming at bringing computation closer to memory. Earlier works, [14, 13], focused on devising architectures that combine processing cores with dynamic random-access memory (DRAM) modules. These architectures generally belong to the category of NMC [43]. However, practical concerns regarding the successful integration of DRAM and processing units into the same chip have hindered the advancement of such NMC systems for many years. Recently, the advent of 3D stacking memories using massive through-silicon-vias (TSVs) provides larger bandwidth while allowing the integration of logic and memory into a stacked chip ([16, 19, 17, 18, 15]).

To bring processing and memory even closer, CiM — where processing is done in the memory array — is gaining a lot of attention recently in both academia and industry. This growing interests largely attribute to the needs of data-intensive IoT applications, and the advances in circuits and device technologies [46, 47]. Many design alternatives exist for CiM, which vary in circuit style, supported operations, device technologies, location in the memory hierarchy, application targets, The most extreme design of CiM is to embed logic operations within each memory cell [9, 10]. Though such designs eliminate the memory access overhead, their negative impact on memory density prevents them to be widely employed. Another CiM design style is modifying the peripheral circuitry of the memory array (either SRAM or DRAM) to realize logic and arithmetic operations. This design style offers a good balance between memory density and processing efficiency. For example, some works proposed to modify the peripheral circuitry, , the sense amplifiers (SAs) of caches, to enable CiM [20, 21], while others accomplish CiM through supporting bulk bit-wise operations using the features of DRAM [34].

Progress in emerging non-volatile device is further fueling the development of CiM. Specifically, non-volatile resistive RAMs (ReRAMs), phase changing memory (PCM), spin-transfer-torque magnetic RAMs (STT-MRAMs), and ferroelectric field effect transistor-based RAMs (FeFET-RAMs) offer high density, good scalability, and low power, making them natural candidates for realizing CiM memory architectures. For instance, there have been a number of recent efforts investigating CiM-capable NVRAMs — employed as either cache or main memory — for various applications. References [23, 24, 22] study the use of NVM with a re-designed SA to perform a subset of logic and arithmetic operations. In [25, 26]

, NVMs are used in content addressable memory (CAM) to support parallel search while reducing data transmission in data-intensive IoT applications. Many recent works also employ NVM-based circuits for neural network acceleration by directly executing matrix-vector multiplication

[45] within the memory array [31, 27, 28].Reference [1] further improves energy efficiency for neural network training and testing through the implementation of a fully-digital scalable CiM architecture.

Fig. 1: Overview of the Eva-CiM framework: data flow, tool chain and architecture.

This paper focuses on systems that contain a host CPU and a CiM module that can be placed in any level of cache hierarchy. Furthermore, the CiM module can be implemented in different technologies/styles and support different instruction sets.

Ii-B Related Work

Many system-level simulators, such as GEM5 [38], zsim [41], Sniper [40], , only cover architectural details for general purpose processor simulation. On the other hand, some existing CiM efforts have attempted to compare different CiM design options by evaluating the energy/performance of the CiM module or accelerator alone [23, 17, 24, 4, 3]. In all cases, the focus is on estimating energy savings due to (i) a lower number of memory accesses in CiM-enabled systems, and (ii) the inherently high internal bandwidth of the memory architecture. Though these comparisons are important in understanding the pros and cons of different CiM designs, they cannot predict the overall benefit offered by CiM based systems since they do not consider how many instructions can actually be offloaded to the CiM module and the effect of such offloading has on the host CPU.

Recent works [23, 35] have tried to estimate the benefits of CiM to data-intensive applications by using custom CiM instructions. The work in [23] extends the original ISA of the Intel Nios II processor [8] with custom CiM instructions. The memory module is assumed to be a small SPM. At compiling time, memory accesses of given application benchmarks are categorized into (i) writes (), (ii) non-convertible () reads, and (iii) CiM convertible () reads, , reads triggered by CiM instructions. The evaluation assumes that every two reads could be effectively replaced by one CiM instruction. Although the approach provides a good insight on the benefits of CiM to systems with a single-level non-cacheable memory, issues like memory hierarchy and locality of data are not taken into consideration. Furthermore, the impact of CiM instructions on the host processor is not studied. Therefore, the method may not be generalized to estimate the overall system-level CiM benefit for most real-world systems that leverage multi-level caches.

As an alternative, the work in [35] implements a set of custom instructions in a x86-64 architecture. Different from [23], multi-level caches are considered in the evaluation. Furthermore, the work proposes taking the data locality into consideration in order to determine whether it is worthwhile to offload potential CiM instructions to the memory unit. Note that the in-memory operations are called by atomic instructions specific for HMC model integrated to the system. However, instead of looking into the memory access breakdown to define the instructions offloaded to the CiM module, the method assumes that the system designer has enough knowledge about the application, and can manually insert CiM-enabled macros into the appropriated code snippets. An obvious limitation to the approach is that it does not offer a systematic way to locate all the possible places in which CiM-enabled macros could be inserted, which inevitably underestimate the benefits of CiM. Different from most current works, [20] explores CiM in three levels of SRAM cache hierarchy and completes the control flow inside cache in the absence of data locality. However, its limitation is the same as [35], which requires customized benchmarks for data locality.

Our work aims to address the needs for a framework to help designers predict how the choice of CiM design options affects the overall system (including both the host and CiM module) energy and performance for a given application. We leverage several existing memory and micro-architecture modeling tools. (Note that these modeling tools focus on either one particular layer of memory or general microprocessors, thus cannot be easily extended for system-level evaluation of CiM based systems.) Specifically, we use DESTINY [33] to estimate energy at array-level for L1/L2 levels of cache, modifying it to support the particularities of specific CiM designs, , customized sense amplifiers [20, 24] and memory cells [24]

. DESTINY is an open-source tool to simulate 2D and 3D memory arrays, which utilizes the 2D modeling framework of NVSim 

[7] for SRAM and NVMs, and the 3D framework of CACTI-3DD [2]. Besides, McPAT [39], an integrated power, area, and timing modeling tool, is modified and used to evaluate different components (, CiM, core, caches) at architecture-level.

Iii Overview of Eva-CiM

Our proposed Eva-CiM framework adopts a combined simulation and analysis approach to accomplish performance, energy estimation for an entire CiM based system. Besides leveraging several existing architecture and circuit simulators, Eva-CiM builds its own models at different design levels. Fig. 1 depicts the overall flow, structure and tool chain of Eva-CiM, which consists of three stages: modeling, analysis and profiling. Eva-CiM takes as input (represented by the orange boxes) the binary of a given application, the device and CiM array parameters for the CiM module, and output the overall evaluation results. The simulation and analysis tools used in Eva-CiM are shown as grey boxes. The current version of Eva-CiM supports two technologies, including SRAM and FeFETs, meanwhile more technologies can be readily added later.

Fig. 2: GEM5 simulation and specialized probes to extract information.

The modeling stage in Eva-CiM aims to construct models to be used in the analysis stage. Specifically, Eva-CiM uses two models: the application model and device/CiM array model. The application model captures when and where instructions are executed and memory accesses occur. This information will be used in the analysis stage to determine offloading candidates. At the application level, Eva-CiM can take any binary that is compiled from GEM5-compatible general-purpose or customize compilers [38]. The benchmark binaries are fed to a modified GEM5 and goes through fetching, decoding and commit with specialized probes to extract the pipeline and memory access information, as illustrated in Fig. 2 (details in Section V-A).

The device/array model describes energy consumption by individual CiM operations, such as CiM-OR, (details in Section V-B). This can be achieved by SPICE simulation [42] if the netlist is available, or users can break down an atomic CiM operation into its micro-operations, cell access, SAs, , and then use DESTINY [33] with pre-calibrated energy data to compute the energy for the atomic operation. Unlike application level simulations, the device/array level simulation is conducted per technology to extract the models.

The analysis stage investigates data dependence and locality and decides the offloading candidates (details in Section IV) and is the cornerstone of Eva-CiM. Several new ideas are introduced here, , the instruction-dependency-graph used to organize and interpret the instruction execution and memory access information obtained from the modeling stage. The key analysis conducted by Eva-CiM include: (i) Committed instruction queue and dependency analysis to automatically detect the underlying offloading patterns; (ii) Memory access and request packet content analysis to determine the particular cache level for CiM; (iii) Instruction trace reshaping to enable system-level profiling.

The profiling stage estimates performance and energy consumption based on the results from the analysis stage. For profiling, Eva-CiM employs modified McPAT [39] and use reshaped instruction queue statistics to analytically compute the energy overheads of different components in the system (details in Section V-C). The tool chain of Eva-CiM shown in Fig. 1 leverages four existing tools. HSPICE [42] and DESTINY [33] are used for memory cell and array modeling. GEM5 [38] is modified to include specialized probes to model applications, while McPAT [39] is enhanced to provide CiM system profiling.

Eva-CiM is not limited to a particular technology/architecture nor development environment, compiler. In other words, it provides a unified and ease-to-use interface for designers to evaluate the pros and cons of CiM based systems. Hence, Eva-CiM can support design space exploration (details in Section VI) including (but are not limited to):

  • Study of the ratio of CiM-supported instructions over regular memory accesses to decide if an application is CiM-friendly or not;

  • Comparison of various device technologies;

  • Determination of the best memory hierarchy level for the CiM module given the concerned applications.

In the following sections, we will detail the design of Eva-CiM by first discussing the analysis stage as it is the core of Eva-CiM and then the modeling and profiling stage. We finally present interesting findings obtained by applying Eva-CiM to show the capability of Eva-CiM and the necessity of such a framework for CiM based system.

Iv Analysis

A key task in evaluating the benefit of including a CiM module to the overall system is to determine what can be offloaded to the CiM module. As discussed in Section II, most prior works make contributions either by manual analysis or by limiting the system to a specialized simple architecture to cater to condition of offloading candidates. Clearly this requires non-trivial efforts for design space exploration especially when considering complex programming and architecture development environments. In view of prior progress that proves the availability of CiM paradigm from circuit level to higher level, we focus on analyzing suitable offloaded instructions based on real-world applications and then evaluate their effects. As discussed in Section III, Eva-CiM presents a unified interface and development environment to hide the aforementioned complexities inside the framework, thereby allowing convenient and efficient design exploration. Specifically, Eva-CiM embeds a trace-driven analyzer in GEM5 [38] to analyze the committed instructions, and then identifies the proper candidates as well as data locality for CiM. In other words, programmers can rely on the framework without manually identifying the critical functions in the code for CiM or dealing with complex development environment. To enable the proposed trace-driven analyzer, we need to answer the following key questions:

  • What instruction patterns can be offloaded to memory to maximize the benefit of CiM?

  • How to analyze the program dependencies to identify and select the proper patterns?

  • How to reshape the instruction queue after offloading to facilitate system level profiling?

Iv-a Offloading Candidate Selection

We first examine what instructions patterns are CiM suitable and can be offloaded to the CiM module. In general, an instruction that is suitable for CiM is featured with source operands fetched from memory and destination operand stored to memory. One common pattern that prior works [23, 24] rely on is a sequence of Load-Load-OP-Store instructions, as on the left of Fig. 3, in which two load operations obtain the source operands, one “OP” instruction (“add” in the figure) conducts a particular operation, and one store operation saves the result. Then this sequence can be replaced by a CiM instruction, , in-cache operation as on the right of Fig. 3 [23].

Fig. 3: An example of Load-Load-OP-Store pattern.

However, due to compiler optimization and usage of intermediate resources (, integer and floating registers), such an exact Load-Load-OP-Store pattern rarely occurs during instruction execution. Instead, the Load-Load-OP-Store pattern adapts to multiple variants, as shown in Fig. 4(b),(c), which are different but all suitable for CiM. Unlike the regular pattern in Fig. 4(a), Fig. 4(b) replaces one source operand with an immediate value while Fig. 4(c) continues using the output before it is stored back to memory. Moreover, it is not uncommon to have a combination of two or more such patterns to form a large CiM-suitable pattern.

In order to capture the complex dependencies among instructions and help identify CiM-suitable patterns, we resort to a graph model, called Instruction Dependency Graph (IDG). In IDG, a “node” is an instruction and a directed “edge” indicates the execution order of the two instructions with data dependency. Fig. 4 are examples of three IDGs. If one straightforwardly builds an IDG for all the instructions being fetched, the IDG would be overwhelmingly complicated.In Section IV-B, we will present an approach to construct a more manageable IDG for a given program.

Besides the instruction execution patterns captured in an IDG, memory access information is also crucial for offloading candidate selection. For example, the operands of a candidate CiM operation should be from the same memory bank. Thus for the level of memory of a leaf node instruction, we need to check if its request address is within the access address of memory objects and then obtain the corresponding Miss-Status Handling Register (MSHR) state [38]. We can do such a procedure repeatedly until we find the memory hierarchy level that stores the data. Depending on the operations that the CiM unit supports, one or multiple sub-trees can be identified as offloading candidates from one IDG tree. Fig. 5 presents a simple example of the procedure to select the offloading candidates, where the IDG tree contains three sub-trees that are identified as proper offloading patterns for CiM.

Fig. 4: IDGs for the original Load-Load-OP-Store pattern and its variants.
Fig. 5: An example of extracting offloading candidates from the committed instruction queue: (a) instruction snippet, (b) the corresponding IDG and its partition, (c) the resulting CiM offloading candidates where each triangle represents one instruction and the number inside the triangle represents its sequence index in the queue.

Procedure:  Offloading candidate selection
Input:  I-state for all instructions
Output:  CiM operations
1. Build register usage table and index hash table ;
2. Build IDG trees for the committed instruction queue ;
3. Partition IDG trees in terms of CiM-supported instruction, and extract the groups that conforms to the offloading patterns;

Algorithm 1 Algorithm for offloading pattern selection.

To support IDG construction and data locality identification, we collect a set of data as given in Table I for all the instructions in the committed instruction queue (CIQ) as only committed instructions are important for program execution. We refer to these data as instruction state (I-state) which can be collected from both CPU and memory as shown in Fig. 2 (more details in Section V). The first three terms in I-state describe when and where an instruction is committed and executed, while the last tree terms detail the memory level as well as its execution status for a memory access/request instruction. Algorithm 1 summarizes the process of selecting offloading candidates when the I-state information is ready. Details about the construction of the various tables and the IDG are given in the next subsection.

I-state element Definition
Sequence index Location of the instruction in the committed instruction queue CIQ
Mnemonic code Assembly code for each instruction
Execution logic Triggered functional unit that executes the instruction
Request from master Request address range of a load instruction and its issuing time
Memory access Address range of accessed memory objects (cache and main memory)
Response from slave Hit/miss status of each memory access
TABLE I: I-state specification.
Fig. 6: Procedure for IDG tree construction: (a) Instruction queue; (b) RUT and IHT; (3) IDG tree.

Procedure: IDG_tree_construction
Input:instruction queue , CiM-supported instruction set , ,
Output: IDG tree Tree
1. for instruction in do
2.  if operation type of in
3.   initialize a with as the root node
4.   
5.   append to
6.  endif
7. endfor
8. return
9. SubProcedure: ()
10. if == and is not a leaf node
11.   lookup by
12.   lookup with []
13.  if operation type of is Load
14.   
15.  endif
16.  endif
17. if == and is not a leaf node
18.   lookup by
19.   lookup with []
20.  if operation type of is Load
21.   
22.  endif
23.  endif
24. if
25.  
26.  endif
27. if
28.  
29. endif

Algorithm 2 Algorithm for IDG tree construction.

Iv-B IDG Construction

Here we present a method to reduce the effort and complexity of constructing the IDG for a given program. It is noted that, with “store” nodes in Fig. 4 removed, IDG simply consists of many flipped trees. Thus, we introduce a compact tree structure with the following restrictions to reduce the redundancy in IDG:

  • With “store” node removed, “OP” instruction is the root of the tree and must be the operation that CiM supports.

  • The left and right children of a node in the tree represent the instructions that feed source data to the node.

  • The leaf node needs to be either a load instruction or an immediate value.

  • An offloading candidate can include one or more connected nodes in the same tree.

  • The data of an offloading candidate need to be in the same memory bank.

Fig. 6 demonstrates the procedure for tree construction. The instruction queue on the left of Fig. 6 lists the instructions as well as their indices in the CIQ. In order to avoid the complexity of recursive search for IDG tree construction, we here introduce the concept of Register Usage Table (RUT), as shown in the middle of Fig. 6. RUT keeps track of the committed time (, sequence index defined in Table I) when a register is used as the destination operand. This is due to the fact that the two connected nodes in an IDG tree must share at least one register. Each row in RUT corresponds to one register and maintains a list of sequence indices of the instructions that use the register. Another auxiliary index hash table (IHT) is also used to keep track of the source operand information for an instruction, with each entry corresponding to an instruction in CIQ. IHT records the registers () used as source operands for an instruction and the corresponding location () of the register when the instruction information is added to RUT. When a CiM-supported instruction is added as a node to an IDG tree, we can use its sequence index and IHT to find its source registers. Then with RUT we can locate the instructions that commit the last use of those registers as destination, which are also the child nodes to be added to the tree. Algorithm 2 summarizes the complete algorithm for tree construction. By repeating this procedure, we can then build the trees for IDG with a complexity, where is the number of nodes in the trees. As on the right of Fig. 6, each node in the tree contains the information of operator, operands, and its sequence index.

For the example in Fig. 6, when the instruction indexed at 3268 is added to the tree, we can first find out its source registers of and through IHT, which also tells us the location that appears of and in RUT when the instruction is committed, , , respectively. Then in RUT, the entry in the list for is just the last instruction that uses as the destination. In other words, the instruction indexed at 3266 is just the left child node to be added in the tree. The same procedure is repeated for the right child. Since the two nodes happen to be “LOAD” operations, the tree terminates at those two leaf nodes, as shown on the right of Fig. 6.

Then, for any application of interest, Eva-CiM can track the virtual address with the proposed IDG, thereby identifying data locality for CiM. Other than that, the framework may also adopt the prior circuit and architecture level efforts [18, 20] that use address translation techniques or memory controller to satisfy the offloading condition, , data accessed by CiM instructions are located on same bit-line. For example, [18] uses a translation mechanism to allocate specific data structures into contiguous regions within the virtual memory space, then maps them to the physical memory space to ensure that the original data reside in the same cache array. [20] further improves cache organization and address translation for operand locality by modifying the cache controller design to deal with offloading address constraints. Since Eva-CiM aims to provide a system level evaluation framework to discover how much potential the technology and architecture may benefit from CiM, we will not place our focus on circuit or architectural innovations, which can be enabled in Eva-CiM with the corresponding architectural models to provide more detailed simulation and higher evaluation accuracy.

Iv-C Trace Reshaping for System Profiling

After offloading candidates are determined, the last task of the analysis stage is to reshape the instruction trace to meet the demands of the profiling stage (to be discussed in detail in Section V-C). The instruction trace reflects the actual execution flow of a program. First, we need to reallocate the execution of selected instructions to the corresponding level of memory where the source data reside. Second, we need to remove those selected offloading instructions from the pipeline, re-organize data locality in the memory and replace them with the corresponding CiM-instructions. The reshaped trace then contains both regular and CiM-supported operations. Through reshaping the instruction trace, all instructions are explicitly allocated to either the function units on the CPU or the CiM module at different levels in the memory hierarchy. Then the profiler (to be discussed in the next section) tracks the activities of both the CPU and the CiM module (including, , instruction types, ALU accesses, L1 hits/misses) to estimate the energy of each module as well as the overall system.

The remaining issue for reshaping is managing data locality and dependency. Note that only when all the operands are available in the same cache level, we can issue the operation to the cache sub-array. Otherwise, we need to write the operand at the higher-level cache back to the lower-level cache, and forward its operator to the same level [20]. Fig. 5(c) shows an example of data dependency that the output of one tree is the input to another. In Eva-CiM, with a regular compiler, we introduce a post-processing step to approximately mimic the CiM behavior. Eva-CiM first traverses all the trees in the post order to ensure the right execution sequence. Then if two sub-trees are exacted from the same IDG tree, Eva-CiM combines them to one in-cache operation to move data and manages data locality within the bank.

V Modeling and Profiling

In this section, we present the details in the modeling and profiling stage. The modeling stage provides the instruction execution and memory access information for a given program. It also provides the CiM model data to the profiling stage. The profiling stage then uses the output from the analysis stage as well as the CiM module data to obtain the overall system energy consumption.

Probe name Monitored object
InstProbe Time and execution in terms of pipeline status for each instruction
PipeProbe Statistics of triggered function units for completing one instruction in CPU
RequestProbe Track of request packet transmitted from LSQ including its issue time and address
AccessProbe Record of memory access including time, access object, and hit/miss status
TABLE II: Probes attached to CPU and memory.

V-a Application Modeling

Fig. 7: InstProbe and PipeProbe attached to an Out-of-Order CPU model.

As we stated earlier, application modeling aims to extract information about when and where instructions are executed and memory accesses occur. To be more precise, application modeling produces the I-state information (see Table I) needed by the analysis stage. we propose to leverage GEM5 augmented with carefully placed probes to obtain the I-state information.

Specifically, Table II summarizes four probes as well as the monitored objects. InstProbe and PipeProbe monitor the execution status and triggered functions in the CPU, while RequestProbe and AccessProbe monitor memory behaviors. Below we discuss these two sets of probes in more details.

InstProbe collects time and execution in terms of pipeline status for each instruction, and PipeProbe collects which and when functional units are triggered by each instruction. There are two complications when collecting these data. First, when there are available resources for execution, multiple instructions are issued from Issue Queue (IQ) to several function units. Second, because of branch mis-prediction, only committed instructions are included in CIQ that is used for our offloading candidate analysis. Thus, these probes must be carefully placed to ensure correct information is collected.

To illustrate how these probes can be placed, we use the example of ARM ISA for a physical-register-file architecture with an out-of-order pipeline. Seven pipeline stages are executed in this architecture as shown in Fig. 7. For each committed instruction, the InstProbe records the tick numbers of different pipeline stages according to the Programming Counter (PC) value. Meanwhile, the PipeProbe keeps tracks of the instruction index in CIQ as well as the statistics of the triggered functional units (, IQ reads/writes, ROB reads/writes). The collected information by the two probes is processed for extracting the sequence index, assembly code and execution logic included in I-state. Then we utilize I-state to obtain the lifetime of an instruction and evaluate the overhead when an instruction is moved from CPU to CiM module.

For RequestProbe and AccessProbe, Fig. 8 describes where they are inserted as well as the information they collect. It is noted that the range of accessible addresses varies with the memory hierarchy level. Thus, a RequestProbe not only probes the tick of instruction execution, its master port, but also the address range of the “Load” instruction. Similarly, an AccessProbe collects tick information, master port, hit or miss statistics of an address range, and status for Miss-status Handling Register (MSHR).

The two probes can effectively capture the packets between the LSQ units and memory objects, so we can accurately obtain the access instruction and its request address. Once the packet is transported to the memory, we can track the packets among different levels in the memory hierarchy with response statistics and cache protocol. Apparently, the probed information depends on the application and the architecture, but is independent of the memory technology.

Fig. 8: RequestProbe and AccessProbe for request packets monitoring.
Technology Level Config
Non-CiM read
CiM-OR
CiM-AND
CiM-XOR
CiM-ADDW32
SRAM L1
4-way/64kB
61 71 72 79 79
L2
8-way/256kB
314 341 344 365 365
FeFET L1
4-way/64kB
34 35 88 105 105
L2
8-way/256kB
70 72 146 205 205
TABLE III: Cache energy (pJ) per operation in different configurations for SRAM and FeFET-based CiM architectures.

V-B CiM Module Modeling

Besides the aforementioned application-related behaviors, the system-level benefits offered by CiM depend on the CiM construction and technology. A CiM module typically consists of a memory array and additional circuitry — often present at the sense amplifier (SA) level — responsible for generating output(s) that correspond to selective logical/ arithmetic operations. SRAM-based caches that can perform bitwise AND, NOR, and XOR operations, among other computations, are proposed in [20]. Alternatively, emerging NVMs have attractive features such as high density, low leakage power, low dynamic energy and fast access times, making them good candidates for the design of CiM main memories or caches. As pointed out in section II, STT-RAM, ReRAM, and FeFET-RAM are among the alternatives studied for the design of CiM architectures. Several CiM architectures proposed for NVMs also make use of a customized SA in a similar way to SRAM-based CiM approach [23, 24, 22]

. Among the aforementioned CiM architectures devised for NVMs, the FeFET-based is probably the most suitable for cache implementations due to its low write energy and latency as reported in

[24]. Thus, we pick SRAM- and FeFET-based CiMs as case studies for the proposed Eva-CiM framework to be presented in section VI.

Fig. 9: Flow for CiM module modeling for different operations.

Fig. 9 illustrates our CiM module evaluation flow. We employ the CMOS and FeFET SPICE models from [6, 5] to evaluate delay and energy of individual 6T-SRAM and 2T+1FeFET memory cells, as well as the customized SAs proposed in [20, 24]. To ensure a fair comparison between both designs, we (i) adopt the same technology node of 45nm in both designs, and (ii) port the full-adder part of the SA described in [24] to the SRAM-based CiM [20]. Thus, SRAM-based CiM and FeFET-based CiM can perform similar operations. We then employ SPICE-level results in a version of DESTINY [33] that has been modified to support the evaluation of FeFET-based memories [24]. DESTINY, as a microarchitecture-level tool for modeling 3D (and 2D) cache designs using SRAM, embedded DRAM, NVM ( STT-RAM, ReRAM, FeFET), has been widely validated against a large amount of multiple industrial prototypes in [33]. Table III describes the energy data per operation (, non-CiM read, CiM read, AND, ADD, ) in different cache configurations obtained by the proposed models for both SRAM and FeFET-RAM, where non-CiM refers to regular operation in this paper. Although the focus of this work is placed upon SRAM and FeFET-RAM, other new technologies (and designs), such as RRAM, can be readily supported, as long as the latency and energy of each in-memory operation are specified.

V-C Profiling

V-C1 Energy Evaluation

Given the models and analyzer in the previous sections, we still need a system-wise profiler to combine the models at different design levels and report the overall system energy profile. Instead of building an energy model from scratch, we modify McPAT [39] to evaluate the energy for both the CiM module and other functional units in a processor. Fig. 10 shows the structure of our system-level profiler which relies on the application model (IDG), CiM model, architecture parameters, and modified McPAT. The original McPAT only computes energy and area for regular functional units using performance counter information (a set of statistics) extracted from an architectural simulator, or GEM5 in our work. In order to enable new CiM instructions, we employ the CiM model for CiM operations as discussed in the last subsection. Moreover, since some instructions are moved to the CiM module, we also need to reevaluate the energy of the host CPU.

Fig. 10: Architecture for CiM-enabled system profiler.

We therefore modify and embed the following performance counters and models in McPAT: (i) Instruction type in pipeline and its count; (ii) Access time of function units in pipeline; (iii) Count of cache/DRAM read/write and hit/miss; (iiii) CiM operation type and its count. Additional performance counters are added for CiM operations to ensure a unified energy model in the profiler. We can then safely invoke McPAT to use the modified performance counters and memory array parameters to estimate the energy consumption of the entire system.

V-C2 Performance Evaluation

Fig. 11: Access latency (cycles) of Non-CiM and CiM operations for SRAM and FeFET technologies.

Although CiM instructions migrate the workload of CPU, they incur additional memory access time. As reported in [24], CiM operations (, AND, ADD, ) may consume more time than regular data access due to the internal data migration and logic operations. Thus, to estimate the system performance, we need to understand the impacts of instruction offloading on both the CiM module and host CPU. In particular, we need to extract the offloading ratio, cycle per instruction (CPI) and access latency of CiM operations to calculate the total time.

Eva-CiM records the execution profile of an instruction stream, which includes a fraction of instructions that have been committed on CPU but now are transferred to the CiM module. In a typical pipelined processor, there are two types of stalls that may impact CPI, memory access and pipeline bubbles. While the number of stalls due to memory access are changed when CiM instructions are used to replace regular instructions, the number of pipeline stalls may actually be reduced for CiM, with some instructions transferred to the CiM module. However, such difference is rather small and averaged out over the entire program trace. So we assume that while some instructions are skipped from CPU and added as CiM instructions, the system may keep constant CPI or execution efficiency.

Then the remaining parameters to be extracted are the access latencies for CiM operations. In Eva-CiM, we employ HSPICE and DESTINY to estimate the latency of CiM and non-CiM operations. With the system clock frequency set at 1 GHz and the cache configuration the same as Table III, Fig. 11 shows the found access cycles of SRAM and FeFET for the particular operations and configurations. It is noted that the memory access time (or access cycles for a given clock frequency) in Fig. 11 is the time that the operand is read or computed. For SRAM cache, the difference in access latency between non-CiM (the first operation in the figure, read) and CiM logic operations (second to forth operations in the figure, , OR) is almost negligible. Since the actual time to complete an access instruction includes data transfer time in the memory hierarchy, access time, and even re-access time if data hit fails, we may safely ignore the subtle difference between those operations and treat them with the same latency in Eva-CiM. On the other hand, CiM ADD operation in a cache bank takes almost four more cycles to complete than non-CiM read, which may result in severe pipeline stall. Eva-CiM counts such actual access time for CiM ADD instruction when profiling the system.

Category Application
Machine learning Naive bayes (NB), decision tree (DT), support vector machine (SVM), linear regression (LiR), Kmeans (KM)
String processing Longest common subsequence (LCS)
Multimedia app. MPEG-2 decode (M2D)
Graph processing Breadth first search (BFS), depth first search (DFS), betweenness centrality (BC), shortest path (SSSP), connected cononent (CCOMP), page rank (PRANK)
SPEC 2006 Astar, H264ref, Hmmer, Mcf
TABLE IV: Benchmark applications.

Vi Experimental Results

We begin this section by first comparing the results obtained from Eva-CiM with those obtained by DESTINY [33] and [23] to help validate Eva-CiM. We then present a number of simulation studies conducted with Eva-CiM to demonstrate its capabilities and to provide insights on how various factors influence the benefits that a program can get from a CiM module. With Eva-CiM, designers can explore the CiM based architectures with different design options, and then answer three key design questions raised in section I. Note that our goal is not to highlight the benefits of CiM, which has already been shown in prior works. Instead, we aim to investigate the pros and cons from a system perspective regarding performance and energy consumption, thereby gaining insights in design tradeoffs for CiM based systems.

All the experiments are based on ARM Cortex A9, out-of-order core, 1.0 GHz system clock, with 512 MB main memory. The cache is configured with different capacities and associativities in our experiments. Here we use the CMOS SRAM as in [20] for CiM implementation, in which all levels of the cache hierarchy are capable to conduct CiM operations. Our experiments employ 17 benchmarks from a wide range of application based on prior works [23, 32, 20, 37, 22], as summarized in Table IV. They are considered as representatives of many typical accelerator workloads to stress Eva-CiM ’s modeling and profiling capabilities across various dimensions.

Fig. 12: Comparisons on the CiM-supported memory accesses between Eva-CiM and [23].

Vi-a Comparison and Validation

Model Energy (nJ)
CiM non-CiM
DESTINY [33] 455.49 124.43
Eva-CiM 565.18 154.40
Deviation 24.0% 24.0%
TABLE V: Energy model comparison with DESTINY [33].
Benchmark NB DT SVM LiR KM LCS M2D BFS DFS BC SSSP CCOMP PR astar h264ref hmmer mcf
Speedup 1.51 1.52 1.42 1.24 1.30 1.31 1.34 1.40 1.55 0.99 1.34 1.52 1.42 1.28 1.17 1.36 1.27
Energy
Improvement
3.28 5.12 2.83 2.68 3.21 4.31 4.85 2.33 1.98 1.30 2.33 3.46 4.54 5.26 2.05 2.87 3.58
Ratio Processor 1.01 0.92 0.92 1.16 0.91 0.91 1.01 0.98 1.53 0.90 1.12 1.01 0.91 0.97 0.86 0.93 0.93
Caches -0.01 0.08 0.08 -0.16 0.09 0.09 -0.01 0.02 -0.53 0.10 -0.12 -0.01 0.09 0.03 0.14 0.07 0.07
TABLE VI: Speedup, energy improvement and improvement breakdown (between processor and cache) for CiM-based non-CiM systems.
Fig. 13: MACR for different benchmarking programs (top); Breakdown of MACR into L1 accesses and other accesses for different benchmarking programs (bottom).

The behavior of CiM depends on not only the benchmarks, but also the entire procedure of compiling, decoding and execution. Thus, the deviations in compiler, core architecture, or memory hierarchy may all impact the results. It is noted that much of the existing literature on CiM accelerators only focuses on the design and optimization of the computational components and internal memory for the CiM module while assuming that all the data needed already resides in CiM. Instead, Eva-CiM is designed to assess how data movement and interactions affect CiM design system.

We choose to conduct the validation by comparing the two major parts of Eva-CiM, energy estimation with DESTINY, and CiM operation count with [23], using one application program, LCS. We adopt the same experiment setup for all the tools for fair comparison, in which all the levels (consisting of 32KB/4-way L1 and 256KB/8-way L2) of cache hierarchy are capable to conduct CiM operations.

We first use Eva-CiM to obtain the energy consumption of a trace of LCS with around 3000 instructions. Then we use DESTINY and a modified McPAT to estimate the energy of these instructions as in Sec. V-B and V-C. As showned in Table V, the energy estimates by the two approaches show around 24% difference for both the CiM and non-CiM instructions. This difference is reasonable since, though Eva-CiM employs DESTINY for per-operation energy estimation, it also accounts for the impact of multi-level cache hierarchy, such as cache access miss, . For performance comparison, since [23] uses an emulation platform with a simplified in-order processor as well as 1 MB SPM, we modify the evaluation architecture accordingly with a cache size of 1MB. Note that the comparison focus is placed upon the count of instructions that are offloaded to the CiM module. We execute the LCS code 20 times with randomly generated inputs and breakdown memory accesses using a similar approach as in [23]. As illustrated in the histogram on the right of Fig. 12, Eva-CiM selects around 65% memory accesses for offloading to CiM while [23] reports 58%. This discrepancy is mainly due to the differences from the two underlying ISAs and higher complexity of the memory hierarchy used in Eva-CiM than the SPM structure used in [23].

In summary, both energy and performance comparisons with existing works show that Eva-CiM results are close enough to gives us confidence in the effectiveness of Eva-CiM. We will further use Eva-CiM to investigate the impact of CiM module on energy and performance as well as various design options.

Fig. 14: Energy improvements for CiM with different cache configurations.
Fig. 15: Energy improvement comparison among CiM supported by only L1, CiM supported by only L2 and CiM supported by both L1 and L2.

Vi-B Performance and Energy Evaluation for System with and without CiM Module

Eva-CiM provides a flexible, modular simulation framework that makes it possible to explore CiM architecture by offering a diverse of evaluation capabilities. This subsection aims to show that the tool can provide in-depth investigations to evaluate whether it is beneficial to include a CiM module in the system, which is a key question many designers are interested in.

We start our investigations with the performance comparison. The speedup of CiM over the non-CiM system is shown in the second row of Table VI, which ranges from 1.0-1.5 for various benchmarks. Meanwhile, it is noted that the performance of some benchmarks actually degrades, , BC in Table VI.

We then evaluate the total energy including both host CPU and cache for the aforementioned application benchmarks and then report the energy improvements. We focus on energy effect of cache caused by CiM without considering subtle influence on the host CPU, then compute the ratio between energy variation and baseline energy. The third row of Table VI shows the total energy improvements, defined as the ratio of the baseline system energy (without a CiM module) over the one when with a CiM module. The energy improvements of the benchmarks range from 1.3-6.0 for various applications. Energy improvements are contributed by both the CiM module (cache) and host CPU, and their breakdowns are shown in the last two rows of Table VI. Note that the energy improvement is mainly contributed by the host side, which is expected due to the reduced number of memory accesses. As a result, across the set of benchmarks, we have mixed results: some programs show positive energy improvement contributions from the CiM module while some have negative contributions.

To help understand the experimental results, we first introduce two terms: CiM-favorable and CiM-unfavorable

. CiM-favorable programs tend to achieve greater energy improvement from CiM. For example, DT, LCS, PR, and astar, can be classified as CiM-favorable programs and hence more suitable for CiM-based systems. On the other hand, CiM-unfavorable programs (

, LiR, DFS) receive much limited improvements.

Vi-C Impact of Benchmarks

Eva-CiM can be used to study the program’s characteristics that influence whether a program is CiM-favorable or not. In many prior works with non-cache-able memory, , [17, 22], any problems with significant memory accesses that have good data locality are considered to be CiM-favorable. However, very few provide detailed breakdown of memory accesses, especially when the CiM module functions as a general purpose computing block. Due to the system complexity, multi-level memory hierarchy and lack of CiM-centric compiler support, it is possible that not all data locality can be exploited by CiM as assumed by prior work. We have conducted experiments on the given application programs to investigate the percentage of instructions that have the “proper” data locality to allow the associated operations to be migrated to the CiM module.

To capture this concept of “proper” data locality, we introduce a metric called memory access conversion ratio (MACR), which is the ratio between the accesses with appropriate locality that can be replaced by CiM operations and the regular memory accesses. Fig. 13 presents the breakdown of memory accesses according to MACR. The results clearly show that, for a given system architecture with a specific CiM design, MACR can be smaller than one even for the programs that are commonly considered as data-intensive, , M2D. For such cases, CiM actually provides relatively low energy improvements, as shown in Table VI. Based on the data shown in Fig. 13 and Table VI, it can be seen that a high MACR (, 50% or more) is an indicator for a program to be CiM-favorable.

Vi-D Impact of System Configuration and Architecture

Eva-CiM helps designers study the impact of system configuration and architecture on a CiM system. Fig. 14 illustrates the results for different cache configurations. Here we have three configurations: (i) 32KB/4-way L1 and 256KB/8-way L2, (ii) 64KB/4-way L1 and 256KB/8-way L2, (iii) 64KB/4-way L1 and 2MB/8-way L2. It is clear that most applications (, NB, LCS, SSP, ) experience higher benefits for larger cache sizes. However, it is also noted that while a larger cache size helps CiM, the energy per operation is also increased (as shown in Table III), which actually reduces the benefit from CiM.

In addition, we investigate the impact of different cache levels that support CiM. Fig. 15 depicts the results of energy improvements when CiM instructions are supported by L1 only, L2 only, and both of them. We use 32KB/4-way L1 and 256KB/8-way L2 for CiM implementation. In general, applications exhibit lower energy improvements when CiM is only supported by L2, which is due to the more frequent L1 access in a system with complete memory hierarchy as well as smaller energy overhead for CiM operations in L1.

Fig. 16: Benefits for CMOS SRAM v.s. FeFET RAM: Energy improvement (top); Performance improvement (bottom).

Vi-E Impact of Technology

We utilize Eva-CiM to explore the performance and energy benefits when using different device technologies for CiM. Due to the flexibility of the profiler, Eva-CiM supports multiple memory technologies given the device parameters that can be provided by circuit modeling as in section V-C. Here we present the performance and energy comparison between CMOS SRAM and FeFET-RAM in Fig. 16. The energy improvements are normalized to the non-CiM baseline system using CMOS SRAM. We observe that the energy benefits for FeFET based CiM is about 50-70% higher, and are consistent across all the benchmarks. Additionally, FeFET SRAM outperforms CMOS SRAM in terms of performance due to its lower latency of CiM operations. Thus, Eva-CiM may provide researchers with the capability of design space exploration to make better trade off among different technologies.

Vii Conclusions

This paper presents a system-level evaluation framework, Eva-CiM, to predict performance and energy consumption of CiM based systems for different architectures/configurations, technologies and benchmarks. Unlike prior work, Eva-CiM relies on a novel IDG based analyzer to automatically detect offloading candidates for CiM and uses multi-level modeling to provide comprehensive evaluations of a CiM based system. Eva-CiM is capable to conduct quantitative investigations and rapid design exploration, thereby further establishing the feasibility of CiM for wide adoption in the near future.

We validate Eva-CiM with two existing works [23] and [33] with respect to both access count and energy consumption. We then investigate various data sensitive benchmarks to explore the count of CiM-supported memory accesses, system speedup and its energy improvement over a non-CiM based design. It is found that for a system with a multi-level memory hierarchy, data sensitive benchmark is not necessarily CiM-sensitive. Moreover, it is not necessarily beneficial to CiM with larger memory sizes due to the increased energy per CiM operation by the CiM module itself. Finally, Eva-CiM evaluates the impact of different architecture configurations and technologies, and the results show that SRAM CiM can provide 1.0-1.5system speedup, 1.3-6.0 energy improvement for SRAM and 2.0-7.9 for FeFET-RAM, respectively.

References

  • [1] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory acceleration of deep neural network training with high precision,” in Proceedings of the 46th International Symposium on Computer Architecture, ser. ISCA ’19.   New York, NY, USA: ACM, 2019, pp. 802–815. [Online]. Available: http://doi.acm.org/10.1145/3307650.3322237
  • [2] K. Chen, S. Li, N. Muralimanohar, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory,” in IEEE/ACM Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 33–38.
  • [3] M. Xie, S. Li, A. O. G.-M. P. ona, J. Hu, Y. Wang, and Y. Xie, “AIM: Fast and energy-efficient AES in-memory implementation for emerging non-volatile main memory,” in IEEE/ACM Design, Automation Test in Europe Conference Exhibition (DATE), March 2018, pp. 625–628.
  • [4] Z. Chowdhury, J. D. Harms, S. K. Khatamifard, M. Zabihi, Y. Lv, A. P. Lyle, S. S. Sapatnekar, U. R. Karpuzcu, and J. Wang, “Efficient In-Memory Processing Using Spintronics,” IEEE Computer Architecture Letters, vol. 17, no. 1, pp. 42–46, Jan 2018.
  • [5] A. Aziz, S. Ghosh, S. Datta, and S. K. Gupta, “Physics-based circuit-compatible spice model for ferroelectric transistors,” IEEE Electron Device Letters, vol. 37, no. 6, pp. 805–808, 2016.
  • [6] R. Vattikonda, W. Wang, and Y. Cao, “Modeling and minimization of pmos nbti effect for robust nanometer design,” in IEEE/ACM Design Automation Conference (DAC), 2006, pp. 1047–1052.
  • [7] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 7, pp. 994–1007, July 2012.
  • [8] Intel Corp., “Nios II Processor,” Mountain View, CA, USA, 2017.
  • [9] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “MAGIC-Memristor-Aided Logic,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, pp. 895–899, Nov 2014.
  • [10] M. Zabihi, Z. Chowdhury, Z. Zhao, U. R. Karpuzcu, J. Wang, and S. Sapatnekar, “In-memory processing onrom hardware design to application mapping,” IEEE Transactions on Computers, pp. 1–1, 2018.
  • [11] B. H. M. Gokhale and K. Iobst, “Processing in Memory: The Terasys Massively Parallel PIM Array,” IEEE Transactions on Computers, vol. 28, no. 4, pp. 23–31, April 1995.
  • [12] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, “Gpus and the future of parallel computing,” IEEE Micro, vol. 31, no. 5, pp. 7–17, 2011.
  • [13] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, “Smart memories: a modular reconfigurable architecture,” in IEEE/ACM Symposium on Computer Architecture (ISCA), June 2000, pp. 161–171.
  • [14] M. Oskin, F. T. Chong, and T. Sherwood, “Active pages: a computation model for intelligent memory,” in IEEE/ACM Symposium on Computer Architecture (ISCA), July 1998, pp. 192–203.
  • [15] JEDEC Solid State Technology Association, “High Bandwidth Memory (HBM) DRAM,” Standard JESD235, 2013.
  • [16] Hybrid Memory Cube Consortium, “Hybrid Memory Cube Specification Rev. 2.0 (2013).”
  • [17] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,” in ACM International Symposium on High-performance Parallel and Distributed Computing (HPDC).   New York, NY, USA: ACM, 2014, pp. 85–98.
  • [18] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, “Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation,” in IEEE International Conference on Computer Design (ICCD), Oct 2016, pp. 25–32.
  • [19] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” in IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 283–295.
  • [20] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 481–492.
  • [21] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 12, pp. 4219–4232, Dec 2018.
  • [22] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories,” in IEEE/ACM Design Automation Conference.   New York, NY, USA: ACM, 2016, pp. 173:1–173:6. [Online]. Available: http://doi.acm.org/10.1145/2897937.2898064
  • [23] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in Memory With Spin-Transfer Torque Magnetic RAM,” IEEE Trans. Very Large Scale Integr. Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018.
  • [24] D. Reis, M. Niemier, and X. S. Hu, “Computing in memory with fefets,” in IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).   New York, NY, USA: ACM, 2018, pp. 24:1–24:6.
  • [25] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring hyperdimensional associative memory,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 445–456.
  • [26] X. Yin, M. Niemier, and X. S. Hu, “Design and benchmarking of ferroelectric fet based tcam,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp. 1444–1449.
  • [27] M. Prezioso, I. Kataeva, F. Merrikh-Bayat, B. Hoskins, G. Adam, T. Sota, K. Likharev, and D. Strukov, “Modeling and implementation of firing-rate neuromorphic-network classifiers with bilayer Pt/Al2O3/TiO2-x/Pt Memristors,” in IEEE International Electron Devices Meeting (IEDM), Dec 2015, pp. 17.4.1–17.4.4.
  • [28]

    M. Jerry, P. Chen, J. Zhang, P. Sharma, K. Ni, S. Yu, and S. Datta, “Ferroelectric fet analog synapse for acceleration of deep neural network training,” in

    IEEE International Electron Devices Meeting (IEDM), Dec 2017, pp. 6.2.1–6.2.4.
  • [29] S. Paul and S. Bhunia, Computing with Memory for Energy-Efficient Robust Systems, 1st ed.   Springer New York, 2014.
  • [30] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “Intelligent ram (iram): chips that remember and compute,” in IEEE International Solid-State Circuits Conference, 1997, pp. 224 – 225.
  • [31] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory,” SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 27–39, Jun. 2016.
  • [32] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” SIGARCH Comput. Archit. News, vol. 43, no. 3, pp. 105–117, Jun. 2015.
  • [33] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A Tool for Modeling Emerging 3D NVM and eDRAM Caches,” in IEEE/ACM Design, Automation & Test in Europe Conference & Exhibition (DATE), San Jose, CA, USA, 2015, pp. 1543–1546.
  • [34] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Buddy-ram: Improving the performance and efficiency of bulk bitwise operations using DRAM,” CoRR, vol. abs/1611.09988, 2016. [Online]. Available: http://arxiv.org/abs/1611.09988
  • [35] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture,” in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2015, pp. 336–348.
  • [36] B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using 3d-stacked DRAM,” in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2015, pp. 131–143.
  • [37] J. Liu, H. Zhao, M. Ogleari, D. Li, and J. Zhao, “Processing-in-memory for energy-efficient neural network training : A heterogeneous approach,” in IEEE/ACM International Symposium on Microarchitecture, 2018, pp. 185–197.
  • [38] N. L. Binkert, B. M. Beckmann, G. Black, S. K. Reinhardt, A. G. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. S. B. Altaf, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
  • [39] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009, pp. 469–480.
  • [40] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, “An evaluation of high-level mechanistic core models,” ACM Transactions on Architecture and Code Optimization (TACO), pp. 28:1–28:25, 2014.
  • [41] D. Sanchez and C. Kozyrakis, “Zsim: Fast and accurate microarchitectural simulation of thousand-core systems,” in IEEE/ACM International Symposium on Computer Architecture (ISCA).   New York, NY, USA: ACM, 2013, pp. 475–486.
  • [42] Synopsys Inc., “HSPICE Version J-2014.09-SP2,” 2014.
  • [43] G. Singh, L. Chelini, S. Corda, A. J. Awan, S. Stuijk, R. Jordans, H. Corporaal, and A.-J. Boonstra, “A review of near-memory computing architectures: Opportunities and challenges,” in Euromicro Conference on Digital System Design, 08 2018, pp. 1–10.
  • [44] C. Zhuo, K. Unda, Y. Shi, and W.-K. Shih, “From layout to system: Early stage power delivery and architecture co-exploration,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
  • [45] Z. Liu, S. Luo, X. Xu, Y. Shi, and C. Zhuo, “A multi-level-optimization framework for fpga-based cellular neural network implementation,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 14, no. 4, p. 47, 2018.
  • [46] C. Zhuo, S. Luo, H. Gan, J. Hu, and Z. Shi, “Noise-aware dvfs for efficient transitions on battery-powered iot devices,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019.
  • [47] S. Luo, C. Zhuo, and H. Gan, “Noise-aware dvfs transition sequence optimization for battery-powered iot devices,” in Proceedings of the 55th Annual Design Automation Conference.   ACM, 2018, p. 27.