Eva-CiM: A System-Level Energy Evaluation Framework for Computing-in-Memory Architectures

01/27/2019 ∙ by Di Gao, et al. ∙ 0

Computing-in-Memory (CiM) architectures aim to reduce costly data transfers by performing arithmetic and logic operations in memory and hence relieve the pressure due to the memory wall. However, determining whether a given workload can really benefit from CiM, which memory hierarchy and what device technology should be adopted by a CiM architecture requires in-depth study that is not only time consuming but also demands significant expertise in architectures and compilers. This paper presents an energy evaluation framework, Eva-CiM, for systems based on CiM architectures. Eva-CiM encompasses a multi-level (from device to architecture) comprehensive tool chain by leveraging existing modeling and simulation tools such as GEM5, McPAT [2] and DESTINY [3]. To support high-confidence prediction, rapid design space exploration and ease of use, Eva-CiM introduces several novel modeling/analysis approaches including models for capturing memory access and dependency-aware ISA traces, and for quantifying interactions between the host CPU and CiM modules. Eva-CiM can readily produce energy estimates of the entire system for a given program, a processor architecture, and the CiM array and technology specifications. Eva-CiM is validated by comparing with DESTINY [3] and [4], and enables findings including practical contributions from CiM-supported accesses, CiM-sensitive benchmarking as well as the pros and cons of increased memory size for CiM. Eva-CiM also enables exploration over different configurations and device technologies, showing 1.3-6.0X energy improvement for SRAM and 2.0-7.9X for FeFET-RAM, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rapid growth of Internet of Things (IoT), the era of "Big Data" is upon us and features massive data transfers between processor and memory [5, 6]. The efficiency of the conventional Von Neumann architecture is severely restricted by its limited bandwidth and increasingly complex interconnects, which result in significant energy and latency overhead for data movement. For instance, the energy spent on transferring 256 bits from main memory to the processor is estimated to be 200 higher than the energy for one floating point operation [7].

Researchers have long been aware of the inefficiency of conventional architecture for data movements [8, 9], and spent significant efforts on the integration of logic and memory [10, 11, 12, 13]. Such integration, often referred as computing in memory (CiM) or processing in memory (PiM), is expected to leverage the computing resources inside memory to overcome the “memory wall" caused by the limited processor-memory bandwidth [14].

Recent works (, [4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]) in both CMOS Static Random Access Memory (SRAM) and emerging non-volatile memories (NVMs) have demonstrated various CiM designs at different levels of memory hierarchy. The designs allow computation to occur exactly where data resides, thereby reducing energy and performance overheads associated with data movement. For example, cache based CiM in [22] achieves 2.4-9 energy saving on text processing scenarios. Meanwhile, NVM based designs such as [4, 21, 24]

can improve energy saving up to two orders of magnitude when functioning as a co-processor on neural network benchmarks.

While CiM is found to be a powerful and promising alternative with various design options, such variety also complicates the design process. Designers are confronted with several important questions when designing CiM:

  • How much can an application program benefit from a CiM based system?

  • At which level of memory hierarchy should one places the CiM?

  • Which technology should be used for CiM?

There are some prior efforts attempting to address the above questions. However, they suffer limitations in several aspects.

Overall system evaluation: Most CiM works ([24, 26, 27]) focus on the CiM module without considering the host CPU or use an emulation platform consisting of a simple host CPU [4]. Interactions between the host and the CiM as well as the complete memory system can be rather complex and the impact on the energy/performance of the overall system can be quite significant.

Offloading candidate identification: Offloading candidate here refers to a code snippet, a function or an instruction that are offloaded from the host CPU to the CiM module for execution. Most prior solutions do not provide instruction set architecture (ISA) nor compiler support to automatically determine offloading candidates. Designers have to either manually identify the code snippets for CiM from the entire benchmark (,  [19, 21, 22]), or select specific instruction groups for offloading to the CiM unit for execution (,  [4, 24]). In the latter two works, memory accesses triggered by a CiM module with custom instructions are identified at compiling time. Two operands fetched from the same level of memory are replaced by one CiM instruction. The method cannot be generalized to systems with multi-level caches as it assumes ideal locality and dependence.

Multi-level modeling: CiMs based on different devices, circuits and micro-architectures have been proposed. However, there is no uniform framework to compare design options at different levels. Though some existing work such as [24] has compared CiMs implemented with different technologies, the comparisons are hand crafted and cannot be easily adapted to different memory hierarchies.

In this paper, we present an architectural evaluation framework, Eva-CiM, that overcomes the above limitations and is able to reliably predict energy consumption of any systems containing a CiM module. The major contributions of this work are summarized below.

We propose a novel trace-driven analysis method to extract data dependencies and locate the offloading candidates. The method is built on an instruction dependency graph model augmented with memory access information. The analyzer is integrated into GEM5 and hence can readily work with different architectures, compilers and development options.

We leverage a comprehensive tool chain from device to architecture to build a multi-level CiM model. We employ GEM5 as the backbone of the framework to fully capture the effects of both the host CPU and the complete memory hierarchy. We further design and embed a probe-based simulation inside GEM5 to collect the necessary information for offloading candidate selection at the application layer. We extend McPAT [2] by including a CiM module obtained from SPICE [28] and DESTINY [3] to provide the architectural-level energy profiling capability.

We employ Eva-CiM to investigate the three questions raised earlier. Unlike prior works which typically assume ideal data locality and regular memory accesses, Eva-CiM is able to find operations that are offloadable to CiM under realistic architecture and compiler settings and thus avoid being overly optimistic. Furthermore, we use Eva-CiM to quantitatively evaluate the energy saving of CiM not only due to the reduced memory accesses but also the lower computational loads on the host. Last but not least, we conduct design space explorations on different technologies, memory hierarchies and host architectures to illustrate the design options that maximize the CiM benefits for a set of benchmarks.

Eva-CiM enables us to make the following findings that are either different from or have not been seen in the conclusions presented in prior works: (i) In a general purpose CiM based system with complete memory hierarchy, the number of possible CiM-supported accesses is similar as the one for regular access; (ii) Data sensitive benchmarks are not necessarily always CiM sensitive and the “friendliness” depends on both benchmark characteristics and CiM system architecture; (iii) Energy wise, larger memory size is not necessarily helpful for CiM due to the increased energy per CiM operation.

2 Background and Related Works

Below, we review the backgrounds on CiM and existing work for CiM modeling and performance/energy estimation.

2.1 Computing in Memory

To address the performance gap between processing and memory access, there have been significant efforts aiming at bringing computation closer to memory. Earlier works, [10, 11, 12], focused on devising architectures that placing processing cores with dynamic random-access memory (DRAM) modules. These architectures generally belong to the category of near-memory computing (NMC) [29]. However, practical concerns regarding the successful integration of DRAM and processing units into the same chip have hindered the advancement of such NMC systems for many years. Recently, the advent of 3D stacking memories that employ massive through-silicon-vias (TSVs) provides larger bandwidth while allowing the integration of logic and memory into a stacked chip ([13, 15, 16, 20, 30]).

To bring processing and memory even closer, the concept of computing-in-memory (CiM), where processing is done in the memory array, is gaining a lot of attention recently in both academia and industry. This growing interests largely attribute to the needs of data-intensive IoT applications, and the advances in circuits and device technologies. Many design alternatives exist for CiM, which vary in circuit style, supported operations, device technologies, location in the memory hierarchy, application targets, The most extreme design of CiM is to embed logic operations within each memory cell [31, 32, 33], which we refer to as fine-grained CiM. Another CiM design style is modifying the peripheral circuitry of the memory array (either SRAM or DRAM) to realize logic and arithmetic operations, which we refer to as coarse-grained CiM. For example, some works have proposed to modify the peripheral circuitry, , the sense amplifiers (SAs), of caches to enable CiM [22, 23], while others accomplish CiM through supporting bulk bit-wise operations using the features of DRAM [34].

Progress in emerging non-volatile device technologies is further fueling the development of CiM. Specifically, non-volatile resistive RAMs (ReRAMs), phase changing memory (PCM), spin-transfer-torque magnetic RAMs (STT-MRAMs), and ferroelectric field effect transistor-based RAMs (FeFET-RAMs) offer high density, good scalability, and low power, making them natural candidates for realizing caches and CiM memory architectures. For instance, there have been a number of recent efforts investigating CiM-capable NVRAMs — employed as either cache or main memory — for various applications. References [4, 24, 35] study the use of NVM with a re-designed SA to perform a subset of logic and arithmetic operations. In [36, 37, 38]

, NVMs are used in content addressable memory (CAM) to support parallel search while reducing data transmission in data-intensive IoT applications. Many recent works also employ NVM-based circuits for neural network acceleration by directly executing matrix-vector multiplication within the memory array, thereby saving the cost of data movements 

[21, 39, 40, 41].

Figure 1: Overview of the Eva-CiM framework: data flow, tool chain and architecture.

This paper focuses on systems that contain a host CPU and a CiM module. The CiM module can be either fine- or coarse-grained, and placed in any level of cache or scratch pad memory (SPM). Furthermore, the CiM module can be implemented in different technologies and circuit styles and support different instruction sets. Given an application, how to estimate the energy benefit of a CiM based system is an important task and the problem that we aim to address.

2.2 Related Works

Many system-level simulators, such as GEM5 [1], zsim [42], Sniper [43], , only cover architectural details for general purpose processor simulation. On the other hand, some existing CiM efforts have attempted to compare different CiM design options by evaluating the energy/performance of the CiM modules or accelerators alone [4, 16, 24, 26, 27]. In all cases, the focus is on estimating energy savings due to (i) a lower number of memory accesses in CiM-enabled systems, and (ii) the inherently high internal bandwidth of the memory architecture. Though these comparisons are important in understanding the pros and cons of different CiM designs, they cannot predict the overall benefit offered by CiM based systems since they do not consider how many instructions can actually be offloaded to the CiM module and the effect of such offloading has on the host CPU.

Recent works [4, 19] have tried to estimate the benefits of CiM to data-intensive applications by using custom CiM instructions in a CiM based system. The work in [4] extends the original ISA of the Intel Nios II processor [44] with custom CiM instructions. The memory module is assumed to be a small (1MB) SPM. At compiling time, memory accesses of given application benchmarks are categorized into (i) writes (), (ii) non-convertible () reads, and (iii) CiM convertible () reads, , reads triggered by CiM instructions. From such memory accesses breakdown, the system-level evaluation assumes that every two reads could be effectively replaced by one CiM instruction.Although the approach provides a good insight on the benefits of CiM to systems with a single-level non-cacheable memory, issues like memory hierarchy and locality of data are not taken into consideration. Furthermore, the impact of CiM instructions on the host processor is not studied. Therefore, the method may not be generalized to estimate the overall system-level CiM benefit for most real-world systems that leverage multi-level caches.

As an alternative, the work in [19] implements a set of custom instructions in a x86-64 architecture. Different from [4], multi-level caches are considered in the evaluation. Furthermore, the work proposes taking the data locality into consideration in order to determine whether it is worthwhile to offload potential CiM instructions to the memory unit. Note that the in-memory operations are called by atomic instructions specific for HMC model integrated to the system. However, instead of looking into the memory access breakdown to define the instructions offloaded to the CiM module, the method assumes that the system designer has enough knowledge about the application, and can manually insert CiM-enabled macros into the appropriated code snippets. An obvious limitation to the approach is that it does not offer a systematic way to locate all the possible places in which CiM-enabled macros could be inserted, which inevitably underestimate the benefits of CiM. Different from most current works, [22] explores CiM in three levels of SRAM cache hierarchy and completes the control flow inside cache in the absence of data locality. However, its limitation is the same as [19], which requires customized benchmarks for data locality.

Our work here aims to address the needs for a framework to help designers predict how the choice of CiM design options affects the overall system (including both the host and CiM module) energy for a given application. We leverage several existing memory and micro-architecture energy modeling tools. (Note that these modeling tools focus on either one particular layer of memory or general microprocessors, thus cannot be easily extended for system-level evaluation of CiM based systems.) Specifically, we use the DESTINY simulator [3] to estimate energy at array-level for L1/L2 levels of cache, modifying it to support the particularities of specific CiM designs, , customized sense amplifiers [22, 24] and memory cells [24]. DESTINY is an open-source, system-level tool devised for the simulation of 2D and 3D caches, as well as SPMs. The tool utilizes the 2D circuit-level modeling framework of NVSim [45] for SRAM and NVMs, and the 3D framework of CACTI-3DD [46]. Besides, McPAT [2], an integrated power, area, and timing modeling tool, is modified and used to evaluate different components (, CiM, core, caches, NoC) at architecture-level.

3 Overview of Eva-CiM

Our proposed Eva-CiM framework adopts a combined simulation and analysis approach to accomplish energy estimation and design space exploration for an entire CiM based system. Besides leveraging several existing architecture and circuit simulators, Eva-CiM builds its own models at different design levels. Figure 1 depicts the overall flow, structure and tool chain of Eva-CiM, which consists of three stages: modeling, analysis and profiling. Eva-CiM takes as input (represented by the orange boxes) the binary of a given application program and the device and CiM array parameters for the CiM module and output the overall system energy consumed by the program execution. The simulation and analysis tools used in Eva-CiM are shown as grey boxes. The current version of Eva-CiM supports two technologies, including SRAM and FeFETs, and more technologies can be readily added later.

Figure 2: GEM5 simulation and specialized probes to extract information.

The modeling stage in Eva-CiM aims to construct models to be used in the analysis stage. Specifically, Eva-CiM uses two models: the application model and device/CiM array model. The application model captures when and where instructions are executed and memory accesses occur. This information will be used in the analysis stage to determine offloading candidates. At the application level, Eva-CiM can take any binary that is compiled from GEM5-compatible general-purpose or customize compilers [1]. The benchmark binaries are fed to a modified GEM5 and goes through fetching, decoding and commit with specialized probes to extract the pipeline and memory access information, as illustrated in Figure 2 (details in Section 5.1).

The device/CiM array model describes energy consumption by individual CiM operations, such as CiM-OR, CiM-ADD, (details in Section 5.2). This can be achieved by SPICE simulation [28] if the netlist is available, or users can first break down an atomic CiM operation into its micro-operations, cell access, amplification, , and then use DESTINY [3] with pre-calibrated energy data to compute the energy for the atomic operation. It is noted that, unlike application level simulations, the device/CiM array level simulation is conducted per technology to extract the models.

The analysis stage investigates data dependence and locality and decides the offloading candidates (details in Section 4) and is the cornerstone of Eva-CiM. Several new ideas are introduced here, , the instruction-dependency-graph used to organize and interpret the instruction execution and memory access information obtained from the modeling stage. The key analyses conducted by Eva-CiM include: (i) Committed instruction queue and dependency analysis to automatically detect the underlying offloading patterns; (ii) Memory access and request packet content analysis to determine the particular cache level for CiM; (iii) Instruction trace reshaping to enable system-level profiling.

The profiling stage estimates energy consumption based on the results from the analysis stage. For profiling, Eva-CiM employs modified McPAT [2] and use reshaped instruction queue statistics to analytically compute the energy overheads of different components in the system (details in Section 5.3).

The tool chain of Eva-CiM shown in Figure 1 leverages four existing tools. HSPICE [28] and DESTINY [3] are used for memory cell and array modeling. GEM5 [1] is modified to include specialized probes to model applications, while McPAT [2] is enhanced to provide CiM system profiling.

Eva-CiM is not limited to a particular technology/architecture nor development environment, compiler. In other words, it provides a unified and ease-to-use interface for designers to evaluate the pros and cons of CiM based systems. Hence, Eva-CiM can support design space exploration (details in Section 6) including (but are not limited to):

  • Study of the ratio of CiM-supported instructions over regular memory accesses to decide if an application is CiM-friendly or not;

  • Comparison of various device technologies;

  • Determination of the best memory hierarchy level for the CiM module given the concerned applications.

In the following sections, we will detail the design of Eva-CiM by first discussing the analysis stage as it is the core of Eva-CiM and then the modeling and profiling stage. We finally present interesting findings obtained by applying Eva-CiM to show the capability of Eva-CiM and the necessity of such a framework for CiM based system.

4 Analysis

A key task in evaluating the benefit of including a CiM module to the overall system is determining what can be offloaded to the CiM module. As discussed in Section 2, most prior works make offloading decisions either by manual analysis or by limiting the system to a specialized simple architecture. Clearly this requires non-trivial efforts for design space exploration especially when considering complex programming and architecture development environments. To address this challenge, as discussed in Section 3, Eva-CiM presents a unified interface and development environment to hide the aforementioned complexities inside the framework, thereby allowing convenient and efficient design exploration. Specifically, Eva-CiM embeds a trace-driven analyzer in GEM5 [1] to analyze the committed instructions, and then identifies the proper candidates as well as data locality for CiM. In other words, programmers can rely on the framework without manually identifying the critical functions in the code for CiM or dealing with complex development environment. In order to enable the proposed trace-driven analyzer, we need to answer the following key questions:

  • What instruction patterns can be offloaded to memory to maximize the benefit of CiM?

  • How to analyze the program dependencies to identify and select the proper patterns?

  • How to reshape the instruction queue after offloading to facilitate system level profiling?

The subsections below answer these three questions and present the overall approach of the proposed analyzer.

4.1 Offloading Candidate Selection

We first examine what instructions patterns are CiM suitable and can be offloaded to the CiM module. In general, an instruction that is suitable for CiM is featured with source operands fetched from memory and destination operand stored to memory. One common pattern that prior works [4, 24] rely on is a sequence of Load-Load-OP-Store instructions, as shown on the left of Figure 3, in which two load operations obtain the source operands, one “OP" instruction (“add" in the figure) conducts a particular operation, and one store operation saves the result. Then this sequence can be replaced by a CiM instruction, , in-cache operation as on the right of Figure 3 [4].

Figure 3: An example of Load-Load-OP-Store pattern.

However, due to compiler optimization and usage of intermediate resources (, integer and floating registers), such an exact Load-Load-OP-Store pattern rarely occurs during instruction execution. Instead, the Load-Load-OP-Store pattern adapts to multiple variants, as shown in Figure 4(b),(c), which are different but all suitable for CiM. Unlike the regular pattern in Figure 4(a), Figure 4(b) replaces one source operand with an immediate value while Figure 4(c) continues using the output before it is stored back to memory. Moreover, it is not uncommon to have a combination of two or more such patterns to form a large CiM-suitable pattern.

In order to capture the complex dependencies among instructions and help identify CiM-suitable patterns, we resort to a graph model, called Instruction Dependency Graph (IDG). In IDG, a “node” is an instruction and a directed “edge” indicates the execution order of the two instructions with data dependency. Figure 4 are examples of three IDGs. If one straightforwardly builds an IDG for all the instructions being fetched, the IDG would be overwhelmingly complicated and contain a lot of redundant information. In Section 4.2, we will present an approach to construct a more manageable IDG for a given program.

Besides the instruction execution patterns captured in an IDG, memory access information is also crucial for offloading candidate selection. For example, the operands of a candidate CiM operation should be from the same memory bank. Thus for the level of memory of a leaf node instruction, we need to check if its request address is within the access address of memory objects and then obtain the corresponding Miss-Status Handling Register (MSHR) state [1]. We can do such a procedure repeatedly until we find the memory hierarchy level that stores the data. Depending on the operations that the CiM unit supports, one or multiple sub-trees can be identified as offloading candidates from one IDG tree. Figure 5 presents a simple example of the procedure to select the offloading candidates, where the IDG tree contains three sub-trees that are identified as proper offloading patterns for CiM.

Figure 4: IDGs for the original Load-Load-OP-Store pattern and its variants.
Figure 5: An example of extracting offloading candidates from the committed instruction queue: (a) instruction snippet, (b) the corresponding IDG and its partition, (c) the resulting CiM offloading candidates where each triangle represents one instruction and the number inside the triangle represents its sequence index in the queue.

Procedure:  Offloading candidate selection
Input:  I-state for all instructions
Output:CiM operations
1. Build register usage table and index hash table ;
2. Build IDG trees for the committed instruction queue ;
3. Partition IDG trees in terms of CiM-supported instruction, and extract the groups that conforms to the offloading patterns;

Algorithm 1 Algorithm for offloading pattern selection.

To support IDG construction and data locality identification, we collect a set of data as given in Table 1 for all the instructions in the committed instruction queue (CIQ) as only committed instructions are important for program execution. We refer to these data as instruction state (I-state) which can be collected from both CPU and memory as shown in Figure 2 (more details in Section 5). The first three terms in I-state describe when and where an instruction is committed and executed, while the last tree terms detail the memory level as well as its execution status for a memory access/request instruction. Algorithm 1 summarizes the high-level process for selecting offloading candidates when the I-state information is ready. Details about the construction of the various tables and the IDG are given in the next subsection.

I-state element Definition
Sequence index Location of the instruction in the committed instruction queue CIQ
Mnemonic code Assembly code for each instruction
Execution logic Triggered functional unit that executes the instruction
Request from master Request address range of a load instruction and its issuing time
Memory access Address range of accessed memory objects (cache and main memory)
Response from slave Hit/miss status of each memory access
Table 1: I-state specification.
Figure 6: Procedure for IDG tree construction: (a) Instruction queue; (b) RUT and IHT; (3) IDG tree.

Procedure: IDG_tree_construction
Input:instruction queue , CiM-supported instruction set , ,
Output: IDG tree Tree
1. for instruction in do
2.  if operation type of in
3.   initialize a with as the root node
4.   
5.   append to
6.  endif
7. endfor
8. return
9. SubProcedure: ()
10. if == and is not a leaf node
11.   lookup by
12.   lookup with []
13.  if operation type of is Load
14.   
15.  endif
16.  endif
17. if == and is not a leaf node
18.   lookup by
19.   lookup with []
20.  if operation type of is Load
21.   
22.  endif
23.  endif
24. if
25.  
26.  endif
27. if
28.  
29. endif

Algorithm 2 Algorithm for IDG tree construction.

4.2 IDG Construction

Here we present a method to reduce the effort and complexity of constructing the IDG for a given program. It is noted that, with “store" nodes in Figure 4 removed, IDG simply consists of many flipped trees. Thus, we introduce a compact tree structure with the following restrictions to reduce the redundancy in IDG:

  • With “store" node removed, “OP” instruction is the root of the tree and must be the operation that CiM supports.

  • The left and right children of a node in the tree represent the instructions that feed source data to the node.

  • The leaf node needs to be either a load instruction or an immediate value.

  • An offloading candidate can include one or more connected nodes in the same tree.

  • The data of an offloading candidate need to be in the same memory bank.

Figure 6 demonstrates the procedure for tree construction. The instruction queue on the left of Figure 6 lists the instructions as well as their indices in the CIQ. (Note that they are elements in the I-state as defined in Table 1.) In order to avoid the complexity of recursive search for IDG tree construction, we here introduce the concept of Register Usage Table (RUT), as shown in the middle of Figure 6. RUT keeps track of the committed time (, sequence index defined in Table 1) when a register is used as the destination operand. This is due to the fact that the two connected nodes in an IDG tree must share at least one register. Each row in RUT corresponds to one register and maintains a list of sequence indices of the instructions that use the register. Another auxiliary index hash table (IHT) is also used to keep track of the source operand information for an instruction, with each entry corresponding to an instruction in CIQ. IHT records the registers () used as source operands for an instruction and the corresponding location () of the register when the instruction information is added to RUT.

When a CiM-supported instruction is added as a node to an IDG tree, we can use its sequence index and IHT to find its source registers. Then with RUT we can locate the instructions that commit the last use of those registers as destination, which are also the child nodes to be added to the tree. Figure 2 summarizes the complete algorithm for tree construction. By repeating this procedure, we can then build the trees for IDG with a complexity, where is the number of nodes in the trees. As shown on the right of Figure 6, each node in the tree contains the information of operator, operands, and its sequence index.

For the example in Figure 6, when the instruction indexed at 3268 is added to the tree, we can first find out its source registers of and through IHT, which also tells us the location that appears of and in RUT when the instruction is committed, , , respectively. Then in RUT, the entry in the list for is just the last instruction that uses as the destination. In other words, the instruction indexed at 3266 is just the left child node to be added in the tree. The same procedure is repeated for the right child. Since the two nodes happen to be “LOAD" operations, the tree terminates at those two leaf nodes, as shown on the right of Figure 6.

4.3 Trace Reshaping for System Profiling

After offloading candidates are determined, the last task of the analysis stage is to reshape the instruction trace to meet the demands of the profiling stage (to be discussed in detail in Section 5.3). The instruction trace reflects the actual execution flow of a program. First, we need to reallocate the execution of selected instructions to the corresponding level of memory where the source data reside. Second, we need to remove those selected offloading instructions from the pipeline, re-organize data locality in the memory and replace them with the corresponding CiM-instructions. The reshaped trace then contains both regular and CiM-supported operations. Through reshaping the instruction trace, the functional units on the CPU execute fewer instructions than a non-CiM design and thus we achieve more accurate estimation of overall system energy.

The remaining issue for reshaping is managing data locality and dependency. Note that only when all the operands are available in the same cache level, we can issue the operation to the cache sub-array. Otherwise, we need to write the operand at the higher-level cache back to the lower-level cache, and forward its operator to the same level [22]. Figure 5(c) shows an example of data dependency that the output of one tree is the input to another. Such dependency can be readily handled by the support from a CiM-centric compiler. In Eva-CiM, with a regular compiler, we introduce a post-processing step to approximately mimic the CiM behavior. Eva-CiM first traverses all the trees in the post order to ensure the right execution sequence. Then if two sub-trees are exacted from the same IDG tree, Eva-CiM combines them to one in-cache operation to move data and manages data locality within the bank.

5 Modeling and Profiling

In this section, we present the details in the modeling and profiling stage. The modeling stage provides the instruction execution and memory access information for a given program. It also provides the CiM model data (referred to as CiM module modeling) to the profiling stage. The profiling stage then uses the output from the analysis stage as well as the CiM module data to obtain the overall system energy consumption.

5.1 Application Modeling

As we stated earlier, application modeling aims to extract information about when and where instructions are executed and memory accesses occur. To be more precise, application modeling produce the I-state information (see Table 1) needed by the analysis stage. we propose to leverage GEM5 [1] augmented with carefully placed probes to obtain the I-state information. Specifically, we introduce four probes in GEM5 to help obtain the I-state information. Table 2 summarizes these probes as well as the monitored objects. InstProbe and PipeProbe monitor the execution status and triggered functions in the CPU (, the first three elements in the I-state), while RequestProbe and AccessProbe monitor memory behaviors (, the last three elements in the I-state). Below we discuss these two sets of probes in more details.

Probe name Monitored object
InstProbe Time and execution in terms of pipeline status for each instruction
PipeProbe Statistics of triggered function units for completing one instruction in CPU
RequestProbe Track of request packet transmitted from LSQ including its issue time and address
AccessProbe Record of memory access including time, access object, and hit/miss status
Table 2: Probes attached to CPU and memory.

InstProbe collects time and execution in terms of pipeline status for each instruction, and PipeProbe collects which and when functional units are triggered by each instruction. There are two complications when collecting these data. First, when there are available resources for execution, multiple instructions are issued from Issue Queue (IQ) to several function units. Second, because of branch mis-prediction, only committed instructions are included in CIQ that is used for our offloading candidate analysis. Thus, these probes must be carefully placed to ensure correct information is collected.

To illustrate how these probes can be placed, we use the example of ARM ISA for a physical-register-file architecture with an out-of-order pipeline. Seven pipeline stages are executed in this architecture as shown in Figure 7. For each committed instruction, the InstProbe records the tick numbers of different pipeline stages according to the Programming Counter (PC) value. Meanwhile, the PipeProbe keeps tracks of the instruction index in CIQ as well as the statistics of the triggered functional units (, IQ reads/writes, ROB reads/writes). The collected information by the two probes is processed for extracting the sequence index, assembly code and execution logic included in I-state. Then we utilize I-state to obtain the lifetime of an instruction and evaluate the system overhead when an instruction is moved from the CPU to a CiM module.

Figure 7: InstProbe and PipeProbe attached to an Out-of-Order CPU model.

For RequestProbe and AccessProbe, Figure 8 describes where they are inserted as well as the information they collect. It is noted that the range of accessible addresses varies with the memory hierarchy level. Thus, a RequestProbe not only probes the tick of instruction execution, its master port, but also the address range of the “Load" instruction. Similarly, an AccessProbe collects tick information, master port, hit or miss statistics of an address range, and status for Miss-status Handling Register (MSHR).

The two probes can effectively capture the packets between the LSQ units and memory objects, so we can accurately obtain the access instruction and its request address. Once the packet is transported to the memory, we can track the packets among different levels in the memory hierarchy with response statistics and cache protocol. Apparently the probed information depends on the application and the architecture, but is independent of the memory technology.

Figure 8: RequestProbe and AccessProbe for request packets monitoring.
Technology Level Config
Non-CiM
read
CiM
read
Comp-OR
Comp-AND
Comp-XOR
Comp-ADDW32
SRAM L1
4-way/64kB
61 68 71 72 79 79
L2
8-way/256kB
314 333 341 344 365 365
FeFET L1
4-way/64kB
34 34 35 88 105 105
L2
8-way/256kB
70 70 72 146 205 205
Table 3: Cache energy (pJ) per operation in different configurations for SRAM and FeFET-based CiM architectures

5.2 CiM Module Modeling

Besides the aforementioned application-related behavior, the system-level benefits offered by CiM depend on the CiM construction and technology. A CiM module typically consists of a memory array and additional circuitry — often present at the sense amplifier (SA) level — responsible for generating output(s) that correspond to selective logical/ arithmetic operations. SRAM-based caches that can perform bitwise AND, NOR, and XOR operations, among other computations, are proposed in [22]. Alternatively, emerging NVMs have attractive features such as high density, low leakage power, low dynamic energy and fast access times, making them good candidates for the design of CiM main memories or caches. As pointed out in section 2, STT-RAM, ReRAM, and FeFET-RAM are among the alternatives studied for the design of CiM architectures. Several CiM architectures proposed for NVMs also make use of a customized SA — in a similar way to SRAM-based CiM approach [4, 24, 35]

. Among the aforementioned CiM architectures devised for NVMs, the FeFET-based is probably the most suitable for cache implementations due to its low write energy and latency as reported in

[24]. For this reason, we pick SRAM-based and FeFET-based CiMs as case studies for the proposed Eva-CiM framework to be presented in section 6.

As part of Eva-CiM simulation flow, we rely on the CMOS and FeFET SPICE models from [47, 48] to evaluate delay and energy of individual 6T-SRAM and 2T+1FeFET memory cells, as well as for the customized SAs proposed in [22, 24]. To ensure a fair comparison between both designs, we (i) adopt the same technology node of 45nm in both designs, and (ii) port the full-adder part of the SA described in [24] to the SRAM-based CiM [22]. Thus, SRAM-based CiM and FeFET-based CiM can perform similar operations. We then employ SPICE-level results in a version of DESTINY [3] that has been modified to support the evaluation of FeFET-based memories [24]. DESTINY is a memory simulator that can provide cache energy per cache-block. Figure 9 illustrates such a CiM module evaluation flow. Table 3 describes the energy data per operation (, non-CiM read, CiM read, AND, ADD, ) in different cache configurations obtained by the proposed models for both SRAM and FeFET-RAM.

Figure 9: Flow for CiM module modeling.

5.3 Profiling

Given the models and analyzer in the previous sections, we still need a system-wise profiler to combine the models at different design levels and report the overall system energy profile. Instead of building an energy model from scratch, we modify McPAT [2] to evaluate the energy and area for both the CiM module and other functional units in a processor. Figure 10 shows the structure of our system-level profiler which relies on the application model, CiM model, architecture parameters, and a modified McPAT. The original McPAT [2] only computes energy and area for regular functional units using performance counter information (a set of statistics) extracted from an architectural simulator, or GEM5 in our work. In order to enable new CiM instructions, we cannot directly use the regular cache access energy model in McPAT. Instead, we employ the CiM model for CiM operations as discussed in the last subsection. Moreover, since some instructions are moved to the CiM module, we also need to reevaluate the energy of the host CPU due to fewer instructions being executed by the CPU.

We therefore modify and embed the following performance counters and models in McPAT:

  • Instruction type in pipeline and its count;

  • Access time of function units in pipeline;

  • The count of cache/DRAM read/write and hit/miss;

  • CiM operation type and its count.

Additional performance counters are added for CiM operations to ensure a unified energy model in the profiler. We can then safely invoke McPAT to use the modified performance counters and memory array parameters to estimate the energy consumption of the entire system.

Figure 10: Architecture for CiM-enabled system profiler.
Category Application
Machine learning Naive bayes (NB), decision tree (DT), support vector machine (SVM), linear regression (LiR), Kmeans (KM)
String processing Longest common subsequence (LCS)
Multimedia app. MPEG to decode (M2D)
Graph processing Breadth first search (BFS), depth first search (DFS), betweenness centrality (BC), shortest path (SSSP), connected cononent (CCOMP), page rank (PRANK)
SPEC 2006 Astar, H264ref, Hmmer, Mcf
Table 4: Benchmark applications.

6 Design Exploration

This section describes experiments for not only validating Eva-CiM, but also exploring the CiM based designs with different technologies for various benchmarks. Note that our goal is not to uncover the benefits of CiM, which has already been shown in prior works. Instead, we aim to investigate the pros and cons from a system perspective regarding energy consumption, and obtain insights on design tradeoffs for CiM based systems.

All experiments are based on ARM Cortex A9, out-of-order core, 2.0GHz clock, with 120 access cycle 512M main memory. Its L1/L2 caches are configured with 5/11 access cycle, directory and MESI based coherence, but with different sizes for experiments. In our experiment, we employ 17 benchmarks from a wide range of application based on prior works [4, 17, 22, 21, 25, 35], as summarized in Table 4.

Model Energy (nJ)
CiM non-CiM
DESTINY [3] 455.49 124.43
Eva-CiM 565.18 154.40
Deviation 24.0% 24.0%
Table 5: Energy model validation.
Figure 11: Comparisons on the CiM-supported memory accesses between Eva-CiM and [4].
Figure 12: Memory access breakdown: CiM-supported accesses v.s. regular memory accesses.
Figure 13: Energy improvement comparison for CiM-based v.s. non-CiM systems: Total energy improvement (top); Improvement breakdown (bottom)
Figure 14: Energy improvements for CiM with different cache configurations.
Figure 15: Energy improvement comparison among CiM supported by only L1, CiM supported by only L2 and CiM supported by both L1 and L2.
Figure 16: Energy improvements for different device technologies: CMOS SRAM v.s. FeFET RAM.

6.1 Model Validation

We validate Eva-CiM by comparing the results from Eva-CiM with those from representative prior works. Note that the results of CiM depends not only on the benchmarks, but also on the compiler and architecture, even on the inputs to the benchmarks, which realistically should be the case. However, since most existing work does not consider the overall system energy, it is actually not easy to find a fair reference for validation.

Instead we here compare the two major parts of Eva-CiM, energy estimation with DESTINY [3] and CiM operation count with [4], using one application program, LCS. For energy estimation, as shown in Figure 11, for a trace with around 3000 instructions, there is around 24% energy difference between Eva-CiM and that computed by DESTINY. This is partly due to the additional overhead from instruction queue reshape, which is not accounted for in DESTINY. For CiM instruction count comparison, since [4] uses an emulation platform with a simplified in-order processor as well as 1MB SPM, we modify the evaluation architecture accordingly with a cache size of 1MB to mimic the behavior of [4]. We break down memory accesses using a similar approach as in [4]. Because memory accesses also vary with different application inputs, we execute the LCS code 20 times with randomly generated inputs. The results are shown in the histogram on the right of Figure 11. Eva-CiM selects around 65% memory accesses for offloading to CiM while [4] reports 58%, which is a 12% relative deviation. This discrepancy is mainly due to the differences from the two underlying ISAs and higher complexity of cache than SPM. The relative closeness between Eva-CiM’s results and those obtained from other published methods gives us confidence in the effectiveness of Eva-CiM.

6.2 Memory Access Breakdown

In many prior works with non-cache-able memory, a significant portion of memory accesses are considered to have good data locality and can be converted to CiM operations. However, other than [4, 24] using a simplified in-order core, very few provide detailed breakdown of memory accesses, especially when the CiM module functions as a general purpose computing block. Due to the system complexity, complete memory hierarchy and lack of CiM-centric compiler support, it is possible that data locality is less ideal than what has been observed in prior work. We have conducted experiments on the given application programs to investigate the percentage of instructions that have data locality for the given system and development environment.

Figure 12 presents the breakdown of cache accesses as the ratio of CiM-supported accesses (, the ones with good locality that can be replaced by CiM operations) over regular accesses. Results from Eva-CiM show that some ratios are smaller than one even for some benchmarks that are considered to be data-sensitive, M2D, for the given evaluation architecture. Thus, for those benchmarks, Eva-CiM inevitably exhibits relatively lower energy savings for the evaluation architecture. In other words, we may need to use CiM-sensitive instead of data-sensitive applications when designing a CiM based system.

6.3 System Level Energy Benefits

In this subsection we evaluate the total energy including both host CPU and cache for the aforementioned application benchmarks and then report the energy improvements of CiM based system v.s. a non-CiM system. Here we use the conventional SRAM as in [22] for CiM implementation, in which all levels in cache hierarchy are capable to conduct CiM operations. The top sub-figure in Figure 13 shows the total energy improvements, ranging from 1.3-6.0 for various applications. which are contributed by both cache and host CPU. The sub-figure on the bottom of Figure 13 further breaks down the contributions from the two parts. The normalized ratio is computed as the energy improvement contributed from host CPU or CiM module over the total improvement. It is interesting to note that the energy saving is mainly contributed by the host side, which is expected due to the reduced number of memory accesses. However, for CiM module, we have mixed results: some show positive saving while others show negative saving (which means that the the CiM module does not help in total energy saving). Thus, those benchmarks with positive CiM contributions can be considered more CiM-sensitive than the ones with negative contribution.

With the findings in this subsection and the last subsection, we can see that for a particular architecture, a data-sensitive benchmark is not necessarily CiM sensitive. We therefore propose a three-entry metric, {CiM-supported access ratio, total energy improvement, cache contribution percentage}, to investigate the CiM-sensitivity of a benchmark. With this metric, we can further divide the benchmarks with a clustering algorithm to CiM-sensitive and CiM-insensitive. For example, for the benchmarks above, “DT", “LCS", “M2D", “PR" and “astar" are classified as CiM-sensitive benchmarks and hence more suitable for CiM based system evaluations.

6.4 Impact of System Configuration and Architecture

In this subsection, we change the cache size and associativity to explore the impact of system configuration and architecture on a CiM system. Figure 14 illustrates the results for different cache configurations. Here we have three configurations: (i) 32KB/4-way L1 and 256KB/8-way L2, (ii) 64KB/4-way L1 and 256KB/8-way L2, (iii) 64KB/4-way L1 and 2MB/8-way L2. It is clear that most applications (, NB, LCS, SSP, ) experience higher benefits for larger cache sizes. However, it is also noted that while a larger cache size helps CiM, the energy per operation is also increased (as shown in Table 3), which actually reduces the benefit from CiM.

In addition, we investigate the impact of different cache levels that support CiM. Figure 15 depicts the results of energy improvements when CiM instructions are supported by L1 only, L2 only, and both of them. In general, applications exhibit lower energy improvements when CiM is only supported by L2, which is due to the more frequent L1 access in a system with complete memory hierarchy as well as smaller energy overhead for CiM operations in L1. These experiments demonstrate that Eva-CiM with CiM modeling provides researchers with the capability to investigate the best configuration for the system.

6.5 Impact of Technology

Finally, we use Eva-CiM to explore the energy benefits when using different device technologies for CiM. We present the comparison between a CMOS SRAM and a FeFET-RAM in Figure 16, where energy improvements are normalized to a non-CiM baseline using a CMOS SRAM. We observe that the energy improvements for FeFET based CiM is about 50-70% higher, and are consistent across all the benchmarks.

7 Conclusion

This paper presents a system-level energy evaluation framework, Eva-CiM, to predict energy consumption of CiM based systems for different architectures/configurations, technologies and benchmarks. Unlike prior work, Eva-CiM relies on a novel IDG based analyzer to automatically detect offloading candidates for CiM and uses multi-level modeling to provide comprehensive evaluations of a CiM based system. Eva-CiM is capable to conduct quantitative investigations and rapid design exploration, thereby further establishing the feasibility of CiM for wide adoption in the near future.

We validate Eva-CiM with two existing works [4] and [3] with respect to both access count and energy consumption. We then investigate various data sensitive benchmarks to explore the count of CiM-supported memory accesses and its energy improvement over a non-CiM based design. It is found that for a system with a multi-level memory hierarchy, data sensitive benchmark is not necessarily CiM-sensitive. Moreover, it is not necessarily beneficial to CiM with larger memory sizes due to the increased energy per CiM operation by the CiM module itself. Finally, Eva-CiM evaluates the impact of different architecture configurations and technologies, and the results show that CiM can provide 1.3-6.0 energy improvement for SRAM and 2.0-7.9 for FeFET-RAM, respectively.

References

  • [1] N. L. Binkert, B. M. Beckmann, G. Black, S. K. Reinhardt, A. G. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. S. B. Altaf, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
  • [2] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009, pp. 469–480.
  • [3] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “DESTINY: A Tool for Modeling Emerging 3D NVM and eDRAM Caches,” in IEEE/ACM Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, pp. 1543–1546.
  • [4] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in Memory With Spin-Transfer Torque Magnetic RAM,” IEEE Trans. Very Large Scale Integr. Syst., vol. 26, no. 3, pp. 470–483, Mar. 2018.
  • [5] P. N. A. Halevy and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8–12, 2009.
  • [6] S. Paul and S. Bhunia, Computing with Memory for Energy-Efficient Robust Systems, 1st ed.   Springer New York, 2014.
  • [7] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, “Gpus and the future of parallel computing,” IEEE Micro, vol. 31, no. 5, pp. 7–17, 2011.
  • [8] B. H. M. Gokhale and K. Iobst, “Processing in Memory: The Terasys Massively Parallel PIM Array,” IEEE Transactions on Computers, vol. 28, no. 4, pp. 23–31, April 1995.
  • [9] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, “Intelligent ram (iram): chips that remember and compute,” in IEEE International Solid-State Circuits Conference, 1997, pp. 224 – 225.
  • [10] M. Oskin, F. T. Chong, and T. Sherwood, “Active pages: a computation model for intelligent memory,” in IEEE/ACM Symposium on Computer Architecture (ISCA), July 1998, pp. 192–203.
  • [11] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, “Smart memories: a modular reconfigurable architecture,” in IEEE/ACM Symposium on Computer Architecture (ISCA), June 2000, pp. 161–171.
  • [12] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca, “The Architecture of the DIVA Processing-in-memory Chip,” in ACM International Conference on Supercomputing (ICS), 2002, pp. 14–25.
  • [13] Hybrid Memory Cube Consortium, “Hybrid Memory Cube Specification Rev. 2.0 (2013).”
  • [14] A. Sebastian, T. Tuma, N. Papandreou, M. Le Gallo, L. Kull, T. Parnell, and E. Eleftheriou, “Temporal correlation detection using computational phase-change memory,” Nature Communications, vol. 8, no. 1, p. 1115, 2017.
  • [15] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” in IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 283–295.
  • [16] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,” in ACM International Symposium on High-performance Parallel and Distributed Computing (HPDC), 2014, pp. 85–98.
  • [17] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” SIGARCH Comput. Archit. News, vol. 43, no. 3, pp. 105–117, Jun. 2015.
  • [18] B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using 3d-stacked DRAM,” in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2015, pp. 131–143.
  • [19] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture,” in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2015, pp. 336–348.
  • [20] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, “Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation,” in IEEE International Conference on Computer Design (ICCD), Oct 2016, pp. 25–32.
  • [21] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory,” SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 27–39, Jun. 2016.
  • [22] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 481–492.
  • [23] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 12, pp. 4219–4232, Dec 2018.
  • [24] D. Reis, M. Niemier, and X. S. Hu, “Computing in memory with fefets,” in IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2018, pp. 24:1–24:6.
  • [25] J. Liu, H. Zhao, M. Ogleari, D. Li, and J. Zhao, “Processing-in-memory for energy-efficient neural network training : A heterogeneous approach,” in IEEE/ACM International Symposium on Microarchitecture, 2018, pp. 185–197.
  • [26] Z. Chowdhury, J. D. Harms, S. K. Khatamifard, M. Zabihi, Y. Lv, A. P. Lyle, S. S. Sapatnekar, U. R. Karpuzcu, and J. Wang, “Efficient In-Memory Processing Using Spintronics,” IEEE Computer Architecture Letters, vol. 17, no. 1, pp. 42–46, Jan 2018.
  • [27] M. Xie, S. Li, A. O. G.-M. P. ona, J. Hu, Y. Wang, and Y. Xie, “AIM: Fast and energy-efficient AES in-memory implementation for emerging non-volatile main memory,” in IEEE/ACM Design, Automation Test in Europe Conference Exhibition (DATE), March 2018, pp. 625–628.
  • [28] Synopsys Inc., “HSPICE Version J-2014.09-SP2,” 2014.
  • [29] G. Singh, L. Chelini, S. Corda, A. J. Awan, S. Stuijk, R. Jordans, H. Corporaal, and A.-J. Boonstra, “A review of near-memory computing architectures: Opportunities and challenges,” in Euromicro Conference on Digital System Design, 08 2018, pp. 1–10.
  • [30] JEDEC Solid State Technology Association, “High Bandwidth Memory (HBM) DRAM,” Standard JESD235, 2013.
  • [31] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, “MAGIC-Memristor-Aided Logic,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, pp. 895–899, Nov 2014.
  • [32] A. Jaiswal, A. Agrawal, and K. Roy, “In-situ, in-memory stateful vector logic operations based on voltage controlled magnetic anisotropy,” Scientific Reports, vol. 8, no. 1, p. 5738, 2018.
  • [33] M. Zabihi, Z. Chowdhury, Z. Zhao, U. R. Karpuzcu, J. Wang, and S. Sapatnekar, “In-memory processing onrom hardware design to application mapping,” IEEE Transactions on Computers, pp. 1–1, 2018.
  • [34] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Buddy-ram: Improving the performance and efficiency of bulk bitwise operations using DRAM,” CoRR, vol. abs/1611.09988, 2016.
  • [35] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories,” in IEEE/ACM Design Automation Conference, 2016, pp. 173:1–173:6.
  • [36] L.-Y. Huang, M.-F. Chang, C.-H. Chuang, C.-C. Kuo, C.-F. Chen, G.-H. Yang, H.-J. Tsai, T.-F. Chen, S.-S. Sheu, K.-L. Su, F. T. Chen, T.-K. Ku, M.-J. Tsai, and M.-J. Kao, “ReRAM-based 4T2R nonvolatile TCAM with 7x NVM-stress reduction, and 4x improvement in speed-wordlength-capacity for normally-off instant-on filter-based search engines used in big-data processing,” in IEEE Symposium on VLSI Circuits, June 2014, pp. 1–2.
  • [37] M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, “Exploring hyperdimensional associative memory,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 445–456.
  • [38] X. Yin, M. Niemier, and X. S. Hu, “Design and benchmarking of ferroelectric fet based tcam,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp. 1444–1449.
  • [39] M. Prezioso, I. Kataeva, F. Merrikh-Bayat, B. Hoskins, G. Adam, T. Sota, K. Likharev, and D. Strukov, “Modeling and implementation of firing-rate neuromorphic-network classifiers with bilayer Pt/Al2O3/TiO2-x/Pt Memristors,” in IEEE International Electron Devices Meeting (IEDM), Dec 2015, pp. 17.4.1–17.4.4.
  • [40]

    D. Lee, J. Park, K. Moon, J. Jang, S. Park, M. Chu, J. Kim, J. Noh, M. Jeon, B. H. Lee, B. Lee, B. Lee, and H. Hwang, “Oxide based nanoscale analog synapse device for neural signal recognition system,” in

    IEEE International Electron Devices Meeting (IEDM), Dec 2015, pp. 4.7.1–4.7.4.
  • [41] M. Jerry, P. Chen, J. Zhang, P. Sharma, K. Ni, S. Yu, and S. Datta, “Ferroelectric fet analog synapse for acceleration of deep neural network training,” in IEEE International Electron Devices Meeting (IEDM), Dec 2017, pp. 6.2.1–6.2.4.
  • [42] D. Sanchez and C. Kozyrakis, “Zsim: Fast and accurate microarchitectural simulation of thousand-core systems,” in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2013, pp. 475–486.
  • [43] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, “An evaluation of high-level mechanistic core models,” ACM Transactions on Architecture and Code Optimization (TACO), pp. 28:1–28:25, 2014.
  • [44] Intel Corp., “Nios II Processor,” Mountain View, CA, USA, 2017.
  • [45] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 7, pp. 994–1007, July 2012.
  • [46] K. Chen, S. Li, N. Muralimanohar, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory,” in IEEE/ACM Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 33–38.
  • [47] R. Vattikonda, W. Wang, and Y. Cao, “Modeling and minimization of pmos nbti effect for robust nanometer design,” in IEEE/ACM Design Automation Conference (DAC), 2006, pp. 1047–1052.
  • [48] A. Aziz, S. Ghosh, S. Datta, and S. K. Gupta, “Physics-based circuit-compatible spice model for ferroelectric transistors,” IEEE Electron Device Letters, vol. 37, no. 6, pp. 805–808, 2016.