I Introduction
While deep learningpowered Internet of Things (IoT) devices promise to revolutionize the way we live and work by enhancing our ability to recognize, analyze, and classify the world around us, this revolution has yet to be unleashed. IoT devices
– such as smart phones, smart sensors, and drones – have limited energy and computation resources since they are batterypowered and have a small form factor. On the other hand, highperformance Convolutional/Deep Neural Networks (CNNs/DNNs) come at a cost of prohibitive energy consumption [68] and can have hundreds of layers [67] and tens of millions of parameters [72, 50]. Therefore, CNN/DNNbased applications can drain the battery of an IoT device very quickly if executed frequently [76], and requires an increase in form factor for storing and executing CNNs/DNNs [11, 58]. The situation continues to worsen due to the fact that CNNs/DNNs are becoming increasingly complex as they are designed to solve more diverse and bigger tasks [32].To close the gap between the constrained resources of IoT devices and the growing complexity of CNNs/DNNs, many energyefficient accelerators have been proposed [44, 45, 10, 13, 46, 73, 7, 1]. As the energy cost of CNN/DNN accelerators is dominated by memory accesses of inputs, weights and partial sums (Psums) (see Fig. 1 (a)) (e.g., up to 95% in DianNao [13]), processinginmemory (PIM) accelerators have emerged as a promising solution in which the computation is moved into the memory arrays and weight movements are eliminated (see Fig. 1 (b)). Among PIM accelerators on various memory technologies [66, 43, 42, 58, 14, 62, 56, 78], resistiverandomaccessmemory(ReRAM)basedPIM (RPIM) accelerators have gained extensive research interest due to ReRAM’s high density (e.g. 25–50 higher over SRAM [79, 71]). However, the energy efficiency of RPIM accelerators (such as PRIME [14], ISAAC [58], and PipeLayer [62]) is still limited due to two bottlenecks (see Fig. 1 (b)): (1) although the weights are kept stationary in memory, the energy cost of data movements due to inputs and Psums is still large (as high as 83% in PRIME [14]); (2) the energy of the interfacing circuits (such as analogtodigital converters (ADCs)/digitaltoanalog converters (DACs)) is another limiting factor (as high as 61% in ISAAC [58]).
To address the aforementioned energy bottlenecks, we analyze and identify opportunities for greatly enhancing the energy efficiency of RPIM accelerators (see Section IIIA), and develop three novel techniques that strive to push data movements and interfaces in PIM accelerators towards local and in time domain (see Section IIIB). While these three techniques are in general effective for enhancing the energy efficiency of PIM accelerators, we evaluate them in a RPIM accelerator, and demonstrate an improvement of energy efficiency by one order of magnitude over stateoftheart RPIM accelerators. The contribution of this paper is as follows:

We propose three new ideas for aggressively improving energy efficiency of RPIM accelerators: (1) adopting analog local buffers (ALBs) within memory crossbars for enhancing (analog) data locality, (2) timedomain interfaces (TDIs) to reduce energy cost of single digitaltoanalog (D/A) (and analogtodigital (A/D)) conversion, and (3) a new mapping method called onlyonce input read (OIR) to further save the number of input/Psum accesses and D/A conversions.

We develop an innovative RPIM architecture (see Section IV), TIMELY (Timedomain, InMemory Execution, LocalitY), that integrates the three aforementioned ideas to (1) maximize (analog) data locality via ALBs and OIR and (2) minimize the D/A (and A/D) interfaces’ energy cost by making use of the more energyefficient TDIs, the ALBs and the OIR method. TIMELY outperforms the most competitive RPIM accelerators in both energy efficiency (over PRIME) and computational density (over PipeLayer) (see Fig. 1 (c)).

We perform a thorough evaluation of TIMELY against 4 stateoftheart RPIM accelerators on 10 CNN and DNN models under various chip configurations, and show that TIMELY achieves up to 18.2 improvement (over ISAAC) in energy efficiency, 31.2 improvement (over PRIME) in computational density, and 736.6 in throughput (over PRIME), demonstrating a promising architecture for accelerating CNNs and DNNs. Furthermore, we perform ablation studies to evaluate the effectiveness of each TIMELY’s feature (i.e., ALB, TDI, and OIR) in reducing energy and area costs, and demonstrate that TIMELY’s innovative ideas can be generalized to other RPIM accelerators.
Ii Background
This section provides the background of RPIM CNN/DNN accelerators. First, we introduce CNNs and the input reuse opportunities in CNNs’ convolutional (CONV) operations in Section IIA, and ReRAM basics in Section IIB. Second, we compare digitaltotime converter (DTC)/timetodigital converter (TDC) and DAC/ADC, which are two types of digitaltoanalog (D/A) and analogtodigital (A/D) conversion, in terms of energy costs and accuracy in Section IIC.
Iia CNN and Input Reuse
CNNs are composed of multiple CONV layers. Given the CNN parameters in Table I, the computation in a CONV layer can be described as:
(1) 
where , , , and denote matrices of the output feature maps, input feature maps, filters, and biases, respectively. Fullyconnected (FC) layers are typically behind CONV layers. Different from CONV layers, the filters of FC layers are of the same size as the input feature maps [10]. Equation (1) can describe FC layers with additional constraints, i.e., , , , and .
Three types of input reuses exist in CNN CONV operations yielding 3D Psums. Consider the example in Fig. 2 where and are set to 1 for simplicity because input reuses are independent on them. First (see Fig. 2 (a)), one input feature map is shared by multiple (e.g. two in Fig. 2) output channels’ filters. Second (see Fig. 2 (b)), as filters slide horizontally, input pixels are reused to generate outputs in the same row – e.g. and are used twice to generate and , respectively. Third (see Fig. 2 (c)), as filters slide vertically, input pixels are reused to generate outputs in the same column – e.g. and are used twice to generate and , respectively. Given a layer with output channels, a filter size of
, and a stride of
, each input pixel is reused times [74]. For example, is reused 8 times in Fig. 2 where =2, ==2, and =1. Note that weights are private in DNN CONV operations, which is the main difference between CNNs and DNNs [82].IiB ReRAM Basics
ReRAM is a type of nonvolatile memory storing data through resistance modulation [71, 69, 65, 39, 30]. An ReRAM cell with a metalinsulatormetal (MIM) structure consists of top/bottom electrodes and a metaloxide layer [71]. Analog multiplication can be performed in ReRAM cells (see Fig. 3 (a)), with the biased voltages serving as inputs, ReRAM cells’ conductance as weights, and resulting currents as outputs. Addition operations are realized through current summing among ReRAM cells of the same columns [28, 77] – e.g. in Fig. 3 (a). At the circuit level, digital inputs are read from an input memory, converted to analog voltages by DACs, and then applied on ReRAM cells. The resulting analog Psums are converted to digital values by ADCs, and then stored back into an output memory.
CNN Params  Description 
batch size of 3D feature maps  
input / output channel  
input feature map height / width  
filter height / width  
stride  
output feature map height / width  
Arch. Params  Description 
# of ReRAM bit cells in one crossbar array is  
# of ReRAM crossbar arrays in one subChip is  
the resistance of the ReRAM bit cell at the row  
and column of a crossbar array  
the time input for the row of an ReRAM crossbar array  
the time Psum for 8bit inputs and 4bit weights  
the logic high voltage of the timedomain signals  
the threshold voltage of a comparator  
the charging capacitance  
the unit delay of a DTC/TDC  
one DTC/TDC is shared by rows/columns  
in one ReRAM crossbar array  
the reset phase of a subChip (reset: =1)  
the number of subChips in one TIMELY chip  
the potential error of one XsubBuf  
Energy Params  Description 
the energy of one conversion in DTC  
the energy of one conversion in TDC  
the energy of one conversion in DAC  
the energy of one conversion in ADC  
the unit energy of accessing PsubBuf  
the unit energy of accessing XsubBuf  
the unit energy of accessing ReRAM input/output buffers 
IiC DTCs/TDCs vs. DACs/ADCs
As shown in Fig. 3 (b), DTCs/TDCs can perform the con
version between an analog time signal and the corresponding digital signal; DACs/ADCs can do so between an analog voltage signal and the digital signal. One digital signal (e.g. in Fig. 3 (b)) can be represented as a time delay with a fixed high/low voltage (corresponding to 1/0) in the time domain (e.g. in Fig. 3 (b)) [12, 8, 3, 5], or as a voltage in the voltage domain (e.g. in Fig. 3 (b)). Compared with a DTC/TDC which can be implemented using digital circuits [16, 4, 52, 80, 51, 40], a DAC/ADC typically relies on analog circuits that (1) are more power consuming and (2) vulnerable to noises and process, voltage and temperature (PVT) variations, and (3) benefit much less from process scaling in energy efficiency [49].
Iii Opportunities and Innovations
This section aims to answer the question of “how can TIMELY outperform stateoftheart RPIM accelerators?” Note that all parameters used in this section are summarized in Table I.
Iiia Opportunities
We first identify three opportunities for greatly reducing energy costs of RPIM accelerators by analyzing performance limitations in stateoftheart designs. Specifically, Opportunity #1 is motivated by the energy bottleneck of (1) input and Psum movements (i.e., Bottleneck ❶ in Fig. 1 (b)) and (2) interfacing circuits (i.e., Bottleneck ❷ in Fig. 1 (b)); Opportunity #2 is inspired by the bottleneck of interfacing circuits; and Opportunity #3 is motivated by both types of bottlenecks.
Opportunity #1. Enhancing (analog) data locality to greatly reduce the energy/time costs of both data movements and D/A and A/D interfaces. We identify this opportunity based on the following considerations. Since inReRAM processing computes in the analog domain, the operands, including inputs, weights, and Psums, are all analog. If we can mostly access analog operands locally, we can expect large energy savings associated with input and Psum movements and largely remove the need to activate D/A and A/D interfaces. In the prior works, the input/Psum movements and interfaces dominate the energy cost of RPIM accelerators. First, input and Psum accesses involve energyhungry data movements. While weights stay stationary in RPIM accelerators, input and Psum accesses are still needed. Although one input/Psum access can be shared by ReRAM cells in the same row/column, for dotproduct operations in a ReRAM crossbar array, a large number of input and Psum accesses are still required. For example, more than 55 million inputs and 15 million Psums need to be accessed during the VGGD [61] and ResNet50 [26] inferences, respectively (see Fig. 4 (a)). While inputs require only memory read, Psums involve both memory write and read, resulting in a large energy cost. As an example, 36% and 47% of the total energy in PRIME [14] are spent on input and Psum accesses, respectively (see Fig. 4 (b)). Second, voltagedomain D/A and A/D conversions involve a large energy cost. For example, in PRIME, except the data movement energy, most of the remaining energy cost is consumed by D/A and A/D conversions (see Fig. 4 (b)).
Opportunity #2. Timedomain interfacing can reduce the energy cost of a single D/A (and A/D) conversion. Since timedomain D/A and A/D conversion is more energy efficient than voltagedomain conversion (see Section IIC), we have an opportunity to use DTCs and TDCs for interfacing between the digital signals stored in memory and analog signals computated in ReRAM crossbar arrays. In prior works, DACs and ADCs limit the energy efficiency of RPIM accelerators. Although ISAAC optimizes the energy cost of its DAC/ADC interface, the interface energy is still as large as 61% in ISAAC (see Fig. 4 (c)). Specifically, ISAAC [58] decreases the number of ADCs by sharing one ADC among 128 ReRAM bitlines, and thus the ADC sampling rate increases by 128, increasing the energy cost of each A/D conversion.
Opportunity #3. Reducing the number of input accesses can save the energy cost of both input accesses and D/A conversions. We find that the input reuse of CNNs can still be improved over the prior works for reducing the energy overhead of input accesses and corresponding interfaces. Though each input connected to one row of an ReRAM array is naturally shared by ReRAM cells along the row, each input on average has to be accessed times. Taking ISAAC [58] as an example, one 16bit input involves times unit eDRAM read energy (i.e. 4416 the energy of a 16bit ReRAM MAC), input register file read energy (i.e. 264.5 the energy of a 16bit ReRAM MAC) and D/A conversion energy (i.e. 109.7 the energy of 16bit ReRAM MAC). For MSRA3 [31] adopted by ISAAC, each input of CONV layers is read and activated the interfaces 47 times on average.
IiiB TIMELY Innovations
The three aforementioned opportunities inspire us to develop the three innovations in TIMELY for greatly improving the acceleration energy efficiency. Fig. 5 (a) and (b) show a conceptual view of the difference between existing RPIM accelerators and TIMELY. Specifically, TIMELY mostly moves data in the analog domain as compared to the fully digital data movements in the existing designs and adopts DTCs and TDCs instead of DACs and ADCs for interfacing.
Innovation #1. TIMELY adopts ALBs to aggressively enhance (analog) data locality, leading to about reduction in data movement energy costs per input and per Psum compared with existing designs, assuming a total of crossbars in each subChip. Multiple subChips compose one chip. One key difference between TIMELY and existing RPIM resides in their subChip design (see Fig. 5 (a) vs. (b)). Specifically, each crossbar (i.e. CB in Fig. 5) in existing designs fetches inputs from a highcost memory (e.g. input buffers in Fig. 5 (a)). Therefore, for each subChip, there is an energy cost of for accessing inputs. In TIMELY (see Fig. 5 (b)), an input fetched from the highcost memory is shared by one row of the subChip thanks to the adopted local ALB buffers (e.g. XsubBufs in Fig. 5 (b)) that are sandwiched between the crossbar arrays, resulting in an energy cost of for handling the same number of inputs, leading to an energy reduction of per input (see Fig. 5 (c). Similarly, each crossbar in existing RPIM accelerators directly writes and reads Psums to and from the highcost output buffers, whereas in TIMELY the Psums in each column of the subChip are accumulated before being written back to the output buffers, leading to an energy cost reduction of per Psum (see Fig. 5 (c). Furthermore, accessing the highcost memory requires about one order of magnitude higher energy cost than that of a local buffer. Specifically, the average energy of one highcost memory access in PRIME is about and higher than that of PsubBufs and XsubBufs [79] in TIMELY, respectively. is typically 10 (e.g. in PRIME). Therefore, about energy reduction for handling input/Psum accesses can be achieved in TIMELY. Additionally, the much reduced requirements of input/output buffer size in TIMELY make it possible to eliminate inter subChip memory (see Fig. 6 (a) and Fig. 9 (c), leading to additional energy savings.
Innovation #2. TIMELY adopts TDIs and ALBs to minimize the energy cost of a single conversion and the total number of conversions, respectively. As a result, TIMELY reduces the interfacing energy cost per input and per Psum by and , respectively, compared with current practices, where and . It is well recognized that the energy cost of ADC/DAC interfaces is another bottleneck in existing RPIM accelerators , in addition to that of data movements. For example, the energy cost of ADCs and DACs in ISAAC accounts for 61% of its total energy cost. In contrast, TIMELY adopts (1) TDCs/DTCs instead of ADCs/DACs to implement the interfacing circuits of crossbars and (2) only one TDC/DTC conversion for each row/column of one subChip, whereas each row/column of crossbar needs one ADC/DAC conversion in existing designs, leading to a total of and reduction per input and Psum, respectively, as compared to existing designs. Specifically, and are about 50 and 20 [52, 41, 63, 38, 58], respectively.
Innovation #3. TIMELY employs OIR to further reduce the number and thus energy cost of input accesses and D/A conversions. As accessing the input and output buffers in subChips costs about one order of magnitude higher energy than that of accessing local buffers between the crossbar arrays (see the left part of Fig. 5 (d)), we propose an OIR strategy to increase the input reuse opportunities for minimizing the cost of input accesses and associated D/A conversion.
Iv TIMELY Architecture
In this section, we first show an architecture overview (see Section IVA), and then describe how the TIMELY architecture integrates the three innovations for aggressively improving the acceleration energy efficiency in Sections IVB, IVC, and IVD, respectively. In addition, we introduce our pipeline design for enhancing throughput in Section IVE and the softwarehardware interface design for offering programmability in Section IVF. Parameters are summarized in Table I.
Iva Overview
Fig. 6 (a) shows the TIMELY architecture, which consists of a number of subChips connected via bus [14, 58]. Specifically, each subChip includes DTCs/TDCs (on the left/at the bottom), ReRAM input/output buffers (on the left/at the bottom), ReRAM crossbars (see ❹ in Fig. 6 (a)) with each having bit cells, a mesh grid of local ALB buffers – i.e., XsubBufs (see ❶ in Fig. 6 (a)) and PsubBufs (see ❷ in Fig. 6 (a)) – between the ReRAM crossbar arrays, current adders (i.e. Iadders, ❸ in Fig. 6 (a)), and a block of shiftandadd, ReLU
, maxpooling units.
The TIMELY architecture processes CNNs/DNNs’ inference as follows. The pretrained weights are preloaded into TIMELY’s ReRAM arrays. Inputs of the CNN/DNN layers are fetched into the input buffers of one subChip or several subChips that handle the corresponding layers, starting from the first CNN/DNN layer. Within each subChip, the inputs are applied to the DTCs for converting the digital inputs into analog time signals, which are then shared by ReRAM bit cells in the same row of all crossbar arrays along the horizontal direction to perform dot products with the corresponding resistive weights. The calculated Psums at the same column of all crossbars in the vertical direction are aggregated in the Iadders, and converted into a voltage signal and then an analog time signal by a charging unit and comparator block (see ❺ in Fig. 6 (a)) before being converted into a digital signal via a TDC. Note that the output of each PsubBuf is connected to the Iadder separately. Finally, the resulting digital signals are applied to the block of shiftandadd, ReLU, maxpooling units, and then written to the output buffers.
IvB Enhancing (Analog) Data Locality
Within each subchip of TIMELY, the converted inputs and calculated Psums are moved in the analog domain with the aid of the adopted ALBs (see Fig. 6 (a)) after the digital inputs are converted into time signals by DTCs and before the Psums are converted into digital signals by TDCs. In this subsection, we first introduce the data movement mechanism and then present the operation of the local analog buffers.
Data Movement Mechanism. In the TIMELY architecture, time inputs from the DTCs move horizontally across the ReRAM crossbar arrays in the same row via XsubBufs (see ❶ in Fig. 6 (a))) for maximizing input reuses and minimizing highcost memory accesses. Meanwhile, the resulting current Psums move vertically via PsubBufs (see ❷ in Fig. 6 (a)). Note that only the crossbars in the leftmost column fetch inputs from DTCs while those in all the remaining columns fetch inputs from their analog local time buffers (i.e., the XsubBufs to their left). Similarly, only the outputs of the Iadders are converted into the digital signals via TDCs before they are stored back into the output buffers, while the current outputs of the crossbars are passed into the Iadders via analog current buffers (i.e., the PsubBufs right below them). In this way, TIMELY processes most data movements in the analog domain within each subchip, greatly enhancing data locality for improving the energy efficiency and throughput.
Local Analog Buffers. The local analog buffers make it possible to handle most (analog) data movements locally in TIMELY. Specifically, XsubBuf buffers the time signals (i.e., outputs of the DTCs) by latching it, i.e., copying the input delay time to the latch outputs (see Fig. 6 (b)); while PsubBuf buffers the current signal outputted from the ReRAM crossbar array, i.e. copying the input current to their outputs (see Fig. 6 (c)). The key is that XsubBuf and PsubBuf are more energy and area efficient than input/output buffers (see Fig. 5 (d)). Specifically, an XsubBuf buffer consists of two crosscoupled inverters that form a positive feedback to speed up the response at its output and thus reduces the delay between its inputs and outputs [70]. Since crosscoupled inverters invert the input, a third inverter is used to invert the signal back. XsubBufs are reset in each pipelinecycle by setting to be high (see Fig. 6 (b)). The PsubBuf buffer is implemented using an NMOSpair current mirror (see Fig. 6 (c)) [37].
IvC TimeDomain Dot Products and DTC/TDC Interfacing
TIMELY performs dot products with timedomain inputs from the DTCs and converts timedomain dot product results into digital signals via TDCs. In this subsection, we first present dot product operations in TIMELY and then introduce their associated DTCs/TDCs.
Dot Products. First, let us consider Psums in one ReRAM crossbar array. Take the first column of the ReRAM crossbar array in Fig. 6 (e) as an example. A total of timedomain inputs are applied to their corresponding ReRAM bit cells with resistance values of (i.e. corresponding to weights) to generate a Psum current (i.e. controlled current) based on the Kirchoff’s Law. Then, let us focus on Psums in one subChip. The Psum currents at the same column of all crossbars in the vertical direction are aggregated in the Iadder [2] (see ❸ in Fig. 6 (a)), and then are converted into a voltage by charging a capacitor (e.g. in Fig. 6 (e)). Fig. 6 (g) shows the input/output characteristic of the dot product. We adopt a 2phase charging scheme [5]. In phase I, the charging time is the input and the charging current is , which corresponds to the weight. In phase II, the charging time is and the charging current is a constant , which is equal to . The charging in phase II ensures the voltage on is larger than , and the time output is defined by . is the minimum mapped resistance of one layer. is the threshold voltage of the comparator, which is equal to , where is the logic high voltage of the time signal, and is the time period of one phase. Based on Charge Conservation, we can derive the output (see in Fig. 6 (e)), where 8b/4b represents 8bit inputs and 4bit weights, to be:
(2) 
To realize dot products with 8bit weights and inputs, we employ a subranging design [47, 84, 22] in which 8bit weights are mapped into two adjacent bitcell columns with the top4 most significant bit (MSB) weights and the remaining 4 least significant bit (LSB) weights, respectively. The charging capacitors associated with MSBweight column and LSBweight column are and , respectively. of the MSBweight column and the LSBweight column are added to get the dotproduct result for 8bit weights.
DTCs/TDCs. We adopt 8bit DTCs/TDCs for TIMELY based on the measurementvalidated designs in [52, 41]. The input/output characteristics of a 8bit DTC is shown in Fig. 6 (f), where digital signals of “1111111” and “00000000” correspond to the timedomain analog signals with the maximum and minimum delays, respectively, and the dynamic range of the timedomain analog signals are with being the unit delay. Meanwhile, a TDC’s input/output characteristics can also be viewed in Fig. 6 (f) by switching the and axes. In TIMELY, is designed to be 50 ps, leading to a conversion time of 25 ns (including a design margin) for the 8bit DTC/TDC. In addition, to trade off energy efficiency and computational density, one DTC/TDC is shared by ( 1) ReRAM crossbar rows/columns.
IvD Timely’s OnlyOnce Input Read Mapping Method
OIR follows three principles: (1) for reusing the inputs by different filters, we map these filters in parallel within the crossbar arrays (see Fig. 7 (a)); (2) for reusing the inputs when sliding the filter vertically within an input feature map, we duplicate the filters with a shifted offset equal to (see Fig. 7 (b)), where and are the filter height and the stride, respectively; and (3) for reusing the inputs when sliding the filter horizontally within an input feature map, we transfer inputs to the adjacent XsubBufs with an step equal to (see Fig. 7 (c)). Singledirection input transfer between adjacent XsubBufs can be implemented by introducing only one switch and one control signal to one XsubBuf.
IvE Pipeline Design
To enhance throughput, we adopt pipeline designs between and within subChips, i.e., intersubChip and intrasubChip pipeline. Different subChips work in a pipeline way. Note that a layer by layer weight mapping strategy is adopted in TIMELY, where one CNN/DNN layer is mapped into one subChip if the ReRAM crossbars’ size is larger than the required size; otherwise, a layer is mapped into multiple subChips. In one subChip, the following operations – reading inputs from input buffers, DTCs, analogdomain computation (including dotproduct, chargingandcomparison operations), TDCs, and writing back to output buffers – are pipelined. The pipelinecycle time is determined by the slowest stage. Let us take the operations within one subChip as an example to illustrate the pipeline in TIMELY. Assuming the first data is read from an input buffer at the first cycle, it spends three cycles to complete the digitaltotime conversion, analogdomain computation, and timetodigital conversion, and is written back to an output buffer at the fifth cycle. Meanwhile, at the fifth cycle, the fifth, fourth, third, and second data is read, converted by a DTC, computed in the analogdomain, and converted by a TDC, respectively.
IvF SoftwareHardware Interface
A softwarehardware interface is adopted to allow developers to configure TIMELY for different CNNs/DNNs, enabling programmability. Similar to the interface in PRIME, three stages are involved from software programming to hardware execution. First, the CNN/DNN is loaded into an NN parser [83] that automatically extracts model parameters. Second, with the extracted parameters, a compiler optimizes mapping strategies for increasing the utilization of ReRAM crossbar arrays and then generates execution commands (including commands for weight mapping and input data path configuration). Third, the controller (see Fig. 6 (a)) loads the commands from the compiler to (1) write pretrained weights to the mapped addresses, and (2) configure peripheral circuits for setting up input paths of computation.
V Discussion
Although local buffers have been adopted in digital accelerators [10, 82], it is challenging when using local buffers in RPIMs because: (1) improper design can largely compromise RPIMs’ high computational density and (2) more frequent largeoverhead A/D and D/A conversions may be caused. To the best of our knowledge, TIMELY is the first to implement and maximize analog data locality via ALBs, which have at least one order of magnitude lower access energy cost compared to the two level memories in PRIME [14]/ISAAC [58]/Pipelayer [62]. Additionally, TIMELY maximizes data locality without degrading RPIMs’ computational density. Although a recent RPIM accelerator, CASCADE [15], has adopted analog buffers, it only uses analog ReRAM buffer to reduce the number of A/D conversions, thereby minimizing computational energy. TIMELY uses ALBs to minimize both the computational energy and data movement energy. Taking PRIME as an example, the computational energy only accounts for 17% of the chip energy. In order to minimize the computational energy, TIMELY not only reduces the number of A/D conversions by ALBs, but also decreases the energy of each A/D conversion by TDCs.
Analog computations and local buffers are efficient, but they potentially introduce accuracy loss to TIMELY. The accuracy loss is mainly attributed to the nonideal characteristics of analog circuits. To address this challenge, TIMELY not only leverages algorithm resilience of CNNs/DNNs to counter hardware vulnerability [9, 48, 81], but also minimize potential errors introduced by hardware, thereby achieving the optimal tradeoff between energy efficiency and accuracy. First, we choose time and current signals to minimize potential errors. Compared with analog voltage signals, analog current signals and digitally implemented time signals can tolerate larger errors caused by their loads, and analog time signal is less sensitive to noise and PVT variations [49]. Second, the adopted ALBs help improve the accuracy of time inputs and Psums by increasing the driving ability of loads. However, the larger the number of ALBs, the smaller the number of ReRAM crossbar arrays in a subChip, compromising the computational density. Based on systemlevel evaluations, we adopt one XsubBuf between each pair of neighboring ReRAM crossbar arrays and one PsubBuf between each ReRAM crossbar array and its Iadder in order to achieve a good tradeoff between accuracy loss and computational density reduction. Third, we limit the number of cascaded XsubBufs in the horizontal direction to reduce the accumulated errors (including noise) of timedomain inputs, which can be tolerated by the given design margin. We assign a design margin (i.e. more than 40 ps) for the unit delay (i.e. 50 ps) of the DTC conversion. We do not cascade PsubBufs to avoid introducing errors in Psum.
TIMELY adopts pipeline designs to address the speed limit of time signal operations and thus improve throughput. Adjusting the number of ReRAM rows/columns shared by one DTC/TDC allows for the tradeoff between the throughput and computational density of TIMELY. TIMELY compensates for the increased area due to the special shifted weight duplication of OIR (see Fig. 7 (b) and (c)) by saving peripheral circuits’ area. Besides, TIMELY also replicates weights to improve computation parallelism and thus throughput, similar to prior designs [14, 58, 62].
Vi Evaluation
In this section, we first introduce the experimental setup, and then compare TIMELY with stateoftheart designs in terms of energy efficiency, computational density, and throughput. After that, we demonstrate the effectiveness of TIMELY’s key features: ALB, TDI, and OIR, and show that these features are generalizable. Finally, we discuss area scaling.
Via Experiment Setup
TIMELY Configuration. For a fair comparison with PRIME/ISAAC, we adopt PRIME/ISAAC’s parameters, including ReRAM and ReLU parameters from PRIME [14], and maxpool operations (scaled up to 65nm) and HyperTransport links from ISAAC [58] (see Table II). For TIMELY’s specific components, we use siliconverified results [52, 41] for DTCs and TDCs, and adopt Cadencesimulated results for XsubBuf, PsubBuf, Iadder, charging circuit, and comparator based on [70, 37, 35] – including their drives and loads during simulation. Supporting digital units (shifter and adder) consume negligibly small amounts of area and energy. All the design parameters of the peripheral circuits are based on a commercial 65nm CMOS process. The power supply is 1.2 V, and the clock rate is 40 MHz. The reset phase in Fig. 6 is 25 ns. The pipelinecycle time is determined by the latency of 8 (setting to 8) DTCs/TDCs, which have a larger latency than other pipelined operations. The latency of reading corresponding inputs, analogdomain computations, and writing outputs back to output buffers are 16 ns [24], 150 ns [24], and 160 ns [24], respectively. In addition, Iadders and its inputs do not contribute to the total area because we insert Iadders and the interconnection between each PsubBuf and Iadder under the charging capacitors and ReRAM crossbars, leveraging different IC layers. We adopt 106 subChips in the experiments for a fair comparison with the baselines (e.g., TIMELY vs. ISAAC: 91mm vs. 88 mm).
Component  Params  Spec  Energy ()  Area () 
/compo.  /compo.  
TIMELY subChip  
DTC  resolution  8 bits  37.5  240 
number  1632  
ReRAM  size  256256  1792  100 
crossbar  number  1612  
bits/cell  4  
Charging+  number  12256  41.7  40 
comparator  
TDC  resolution  8 bits  145  310 
number  1232  
XsubBuf  number  1216256  0.62  5 
PsubBuf  number  1512256  2.3  5 
Iadder  number  12256  36.8  40 
ReLU  number  2  205  300 
MaxPool  number  1  330  240 
Input buffer  size/number  2KB/1  12736  50 
Output buffer  size/number  2KB/1  31039  50 
Total  0.86 mm  
TIMELY chip (40 MHz)  
subChip  number  106  0.86 mm  
Total  91 mm  
Inter chips  
Hyper link  links/freq  1/1.6GHz  1620  5.7 mm 
link bw  6.4 GB/s 
Scaling TIMELY to an area of 0.86 mm by adjusting the number of subChips (i.e., ) based on applications.
Methodology. We first compare TIMELY with 4 stateoftheart RPIM accelerators (PRIME [14], ISAAC [58], PipeLayer [62], and AtomLayer [56]) in terms of peak energy efficiency and computational density. For this set of experiments, the performance data of the baselines are the ones reported in their corresponding papers. Second, as for the evaluation regarding various benchmarks, we consider only PRIME [14] and ISAAC [58] because (1) there is lack of design detail information to obtain results for PipeLayer [62] and AtomLayer [56], and (2) more importantly, such comparison is sufficient given that PRIME [14] is the most competitive baseline in terms of energy efficiency (see Fig. 1 (c)). For this set of evaluations, we build an inhouse simulator to evaluate the energy and throughput of PRIME, ISAAC, and TIMELY. Before using our simulator, we validate it against PRIME’s simulator [14] and ISAAC’s analytical calculations [58]. We set up our simulator to mimic PRIME and ISAAC and compare the results of our simulator with their original results. The resulting errors of energy and throughput evaluation are 8% and zero, respectively, which are acceptable by TIMELY’s one order of magnitude improvement on energy efficiency (see Section VIB). Due to the lack of ISAAC’s mapping information, we only validate our simulator against PRIME’s simulator to get the energy error by adopting PRIME’s component parameters and weight mapping strategy [14] in our simulator. Since PRIME does not support interlayer pipeline, we only validate our simulator against ISAAC’s analytical calculations to get the throughput error by using ISAAC’s component parameters and balanced interlayer pipeline [14] in our simulator. The interlayer pipeline corresponds to TIMELY’s intersubChip pipeline.
Benchmarks. We evaluate TIMELY using a total of 15 benchmarks. Table III shows these benchmarks and the reasons for adopting them.
Benchmarks  Why consider these CNN/DNN models 
VGGD, CNN1, MLPL  For a fair comparison with PRIME 
(i.e. benchmarks in [14])  
VGG1/2/3/4  For a fair comparison with ISAAC 
MSRA1/2/3  (i.e. benchmarks in [58]) 
ResNet18/50/101/152  To show TIMELY’s performance 
SqueezeNet  in diverse and more recent CNNs 
ImageNet ILSVRC dataset [17]; MNIST dataset [18]
ViB Evaluation Results
We evaluate TIMELY’s peak energy efficiency and computational density against those reported in [14], [58], [62], and [56]. Next, we perform an evaluation of TIMELY’s energy efficiency and throughput on various CNN and DNN models.
Energy  Improve  Computational  Improve  
efficiency  ment of  density  ment of  
TIMELY  TIMELY  
PRIME [14]  2.10  1.23  
ISAAC [58]  0.38  0.48  
PipeLayer [62]  0.14  1.49  
AtomLayer [56]  0.68  0.48  
TIMELY  21.00  n/a  38.33  n/a 
TIMELY  6.90  n/a  9.58  n/a 
one operation: 8bit MAC; one operation: 16bit MAC
Overall Peak Performance. Compared with representative RPIM accelerators (see Table IV), TIMELY can improve energy efficiency by over 10 (over PRIME [14]) and the computational density by over 6.4 (over PipeLayer [62]). In particular, TIMELY improves energy efficiency by 10 to 49.3 and computational density by 6.4 to 31.2. These large improvements result from TIMELY’s innovative features of ALB, TDI, OIR and intrasubChip pipelines, which can aggressively reduce energy cost of the dominant data movements and increase the number of operations given the same time and area. In Table IV, we ensure that TIMELY’s precision is the same as that of the baselines for a fair comparison. Specifically, we consider a 8bit TIMELY design when comparing with PRIME and a 16bit TIMELY design when comparing to ISAAC, PipeLayer, and AtomLayer.
Energy Efficiency on Various CNN and DNN models. We evaluate TIMELY on various models (1 MLP and 13 CNNs) to validate that its superior performance is generalizable to different computational and data movement patterns. Fig. 8 (a) shows the normalized energy efficiency of TIMELY over PRIME and ISAAC. We can see that TIMELY outperforms both PRIME and ISAAC on all CNN and DNN models. Specifically, TIMELY is on average 10 and 14.8 more energy efficient than PRIME and ISAAC
, respectively (see the Geometric Mean in the rightmost part of Fig.
8 (a)). This set of experimental results demonstrates that TIMELY’s superior energy efficiency is independent of CNNs and DNNs – i.e. computational and data movement patterns. In addition, as shown in Fig. 8 (a), the energy efficiency improvement of TIMELY decreases in small or compact CNNs, such as CNN1 [14] and SqueezeNet [29]. This is because their energy costs of data movements are relatively small. These models can be mapped into one ReRAM bank of PRIME or one ReRAM tile of ISAAC, and thus do not require high cost memory accesses and limit the energy savings achieved by TIMELY.Throughput on Various CNNs. Fig. 8 (b) shows TIMELY’s normalized throughput over PRIME and ISAAC on various CNNs (a total of 8 CNNs) considering three chip configurations (16, 32, and 64 chips). As the throughput is a function of the weight duplication ratio, we only consider CNNs for which PRIME or ISAAC provides corresponding weight duplication ratios. Compared to PRIME, TIMELY enhances the throughput by 736.6 for the 16chip, 32chip, and 64chip configurations on VGGD. TIMELY’s advantageous throughput results from its intrasubChip pipeline, which enables to minimize the latency between two pipelined outputs. In addition, PRIME can work in both the memory mode and computation mode (i.e. accelerating CNN), limiting the number of crossbars for CNN computations (and thus its throughput) on a chip which is over 20 smaller than that of TIMELY (i.e. 1024/20352, see the right corner of Fig. 8 (b)). Compared to ISAAC on 7 CNNs, TIMELY, on average, enhances the throughput by 2.1, 2.4, and 2.7 for the 16chip, 32chip, and 64chip configurations, respectively. In Fig. 8 (b), we consider only 64chip or (32chip and 64chip) for large CNNs, such as MSRA1/2/3, to ensure that all the models can be mapped into one TIMELY or ISAAC accelerator. TIMELY’s enhanced throughput is because ISAAC adopts serial operations and requires 22 pipelinecycles (each being 100 ns) to finish one 16bit MAC operation, for which TIMELY employs intrasubChip pipelines and needs two pipelinecycles (each being 200 ns).
Accuracy. We observe 0.1% inference accuracy loss under various CNN and DNN models in systemlevel simulations including circuitlevel errors extracted from Cadence simulation. The simulation methodology is adapted from prior work [33, 34]. Specifically, we first obtain noise and PVT variations (by MonteCarlo simulations in Cadence) of XsubBuf, PsubBuf, Iadder, DTC, and TDC. The errors follow Gaussian noise distribution. We then add equivalent noise during training and use the trained weights for inference. Note that prior work has proved that adding Gaussian noise to training can reach negligible accuracy loss [54, 53, 57]. To achieve 0.1% accuracy loss, we set the number of cascaded XsubBufs to 12. The accumulated error of the cascaded XsubBufs is [20], where is the potential error of one XsubBuf. is less than 202 ps, which can be tolerated by the design margin of 402 ps and thus do not cause a loss of inference accuracy.
ViC Effectiveness of TIMELY’s Innovations
We first validate the effectiveness of TIMELY’s innovations on energy saving and area reduction, and then demonstrate that TIMELY’s innovative principles can be generalized to stateoftheart RPIM accelerators to further improve their energy efficiency.
Effectiveness of TIMELY’s Innovations on Energy Savings. We here present an energy breakdown analysis to demonstrate how TIMELY reduces the energy consumption on VGGD as compared with PRIME, which is the most competitive RPIM accelerator in terms of energy efficiency. In Fig. 8 (a), we can see that TIMELY improves the energy efficiency by 15.6 as compared to PRIME.
Overview. We first show the breakdown of energy savings achieved by different features of TIMELY. TIMELY’s ALB and OIR contribute to up to 99% of the energy savings, and its TDI leads to the remaining 1% (see Fig. 9 (a)).
Effectiveness of TIMELY’s ALB and OIR. We compare TIMELY’s energy breakdown with regard to both memory types and data types with those of PRIME in Fig. 9 (c) and (d), respectively. In Fig. 9 (c), TIMELY’s ALB and OIR together reduce the energy consumption of memory accesses by 93% when compared with PRIME. Specifically, the ALB and OIR features enable TIMELY to fully exploit local buffers within its subChips for minimizing accesses to the L1 memory and removing the need to access an L2 memory.
In Fig. 9 (d), TIMELY reduces the energy consumption associated with the data movement of Psums, inputs and outputs by 99.9%, 95.8%, and 87.1%, respectively. The contributing factors are summarized in Fig. 9 (e). Specifically, (1) TIMELY can handle most of the Psums locally via the PsubBufs within the subChips, aggressively reducing the energy cost of data movements of Psums; (2) TIMELY’s OIR feature ensures all the input data are fetched only once from the L1 memory while its ALB feature (i.e. XsubBufs here) allows the fetched inputs to be stored and transferred via XsubBufs between the crossbars; and (3) thanks to employed PsubBufs and XsubBufs, TIMELY removes the need for an L2 memory, which has higher read/write energy than that of an L1 memory, respectively, reducing the energy cost of writing outputs back to the memory (L1 memory in TIMELY vs. L2 memory in PRIME). Furthermore, as another way to see the effectiveness of TIMELY’s OIR feature, we summarize both PRIME’s and TIMELY’s total number of input accesses to the L1 Memory in Table V (consider the first six CONV layers as examples). TIMELY requires about 88.9% less L1 memory accesses.
CONV1  CONV2  CONV3  CONV4  CONV5  CONV6  
PRIME [14]  1.35 M  28.90 M  7.23 M  14.45 M  3.61 M  7.23 M 
TIMELY  0.15 M  3.21 M  0.80 M  1.61 M  0.40 M  0.80 M 
Save by  88.9%  88.9%  88.9%  88.9%  88.9%  88.9% 
Effectiveness of TIMELY’s TDI. Although DTCs and TDCs only contribute to 1% of TIMELY’s energy savings over PRIME, the total energy of DTCs and TDCs in TIMELY is 99.6% less than that of ADCs and DACs in PRIME (see Fig. 9 (b)). It is because (1) the unit energy of one DTC/TDC is about 30%/23% of that of DAC/ADC; (2) the increased analog data locality due to ALBs largely reduces the need to activate DTCs and TDCs; and (3) TIMELY’s OIR feature aggressively reduces the required DTC conversions thanks to its much reduced input accesses to the L1 memory (see Table V).
Effectiveness of TIMELY’s Innovations on Area Reduction. We analyze an area breakdown to present the effectiveness of TIMELY’s innovations on the area savings of peripheral circuits, which helps to improve the computational density (see Table IV). In Fig. 10 (a), the percentage of the ReRAM array area in TIMELY (i.e. 2.2%) is 5.5 higher than that in ISAAC (i.e. 0.4%) [58]. The percentage of the ReRAM array area in PRIME is small enough and thus ignored [14]. The higher percentage of the ReRAM crossbar array area in TIMELY benefits from areaefficient circuit implementations of TIMELY’s ALB, TDI and OIR. Specifically, in TIMELY shown in Fig. 10 (b), XsubBufs and PsubBufs occupy 55.2% of the chip area; DTCs and TDCs occupy 28% of the chip area; the area of CMOS logic introduced by OIR is neglectable.
Generalization of TIMELY’s Innovations. TIMELY’s innovative features are generalizable and can be applied to stateoftheart RPIM accelerators for boosting their energy efficiency. To demonstrate, we apply ALB and OIR to PRIME based on the following considerations. ALB feature associated with OIR contributes the dominant energy savings (see Fig. 9 (a)). From the perspective of data accesses and interfaces, PRIME uses the same architecture shown in Fig 5 (a) as ISAAC [58]/PipeLayer [62]. To evaluate, we modify PRIME architecture as shown in Fig. 11 (a). We add XsubBufs and PsubBufs between 128 ReRAM crossbar arrays in FF subarray of each bank, and modify the weights mapping and input access dataflow based on OIR, while employing PRIME’s original designs outside FF subarray. Thus, ALB and OIR only have an impact on the intrabank energy. In this experiment, we adopt the same component parameters as those used in the PRIME’s original design. Fig. 11 (b) shows that applying ALB and OIR principle to FF subarrays in PRIME reduces the intrabank data movement energy by 68%.
ViD Discussion
Area scaling of TIMELY (by adjusting the number of subChips shown in Table II) does not affect throughput and slightly affects energy. This is because throughput is determined only by intrasubChip pipeline (see Section IVE); adjusting the number of subChip in one chip will only change interchip energy (i.e., the energy of memory L3 in Fig. 9 (c)), which accounts for a negligible part of the total energy.
Vii Related Work
NonPIM CNN/DNN Accelerators. Although memory is only used for data storage in nonPIM accelerators, computing units are being pushed closer to compact memories to reduce energy and area. For accelerators with offchip DRAM, DRAM accesses consume two orders of magnitude more energy than onchip memory accesses (e.g. 130 higher than a 32KB cache at 45 nm [27]). As a result, DRAM consumes more than 95% of the total energy in DianNao [13, 19]. To break through the offchip bottleneck, onchip SRAM is widely used as the mainstream onchip memory solution [25]. However, SRAM’s low density has been limiting its ondie capacity even with technology scaling. For example, EIE adopts 10MB SRAM that takes 93.2% of the total area [25]. To address the area issue, onchip eDRAM is used in RANA [64] and DaDianNao [11], as eDRAM can save about 74% area while providing 32 KB capacity in 65 nm [64, 55]. However, the refresh energy in eDRAM can be dominant (e.g. about 10 as high as the data access’ energy [64]). In terms of FPGAbased designs, the performance is also limited by the memory accesses [21, 55, 23, 59] with limited flexibility of choosing memory technologies. Different from these nonPIM accelerators, TIMELY improves energy efficiency by computing in memory and enhances computational density through adopting highdensity ReRAM.
PIM CNN/DNN Accelerators. While PIM accelerators integrate computing units in memory to save the energy of accessing weights, the achievable energy efficiency and computational density remain limited. The limited energy efficiency is induced by the energy cost of input and Psum movements and the overhead of interfacing circuits. PRIME [14] takes 83% of the total energy to access inputs and Psums, and ISAAC [58] consumes 61% of the total energy to operate DACs/ADCs. The limited computational density is related to memory technologies. Processing in SRAM, for example, faces this limitation. The reasons include not only one SRAM bitcell typically stores only 1bit weight [6, 23, 36, 75, 60] but also SRAM’s bitcell structure – e.g. 6T [23, 36]/8T [75, 60]/10T [6] structure – decreases density. Proposed TIMELY adopts highdensity ReRAM, and addresses two key energy challenges with techniques including ALBs, TDIs, and OIR. TIMELY achieves up to 18.2 improvement (over ISAAC) in energy efficiency, 31.2 improvement (over PRIME) in computational density, and 736.6 in throughput (over PRIME). Similar to the effect of ALBs used in TIMELY, a recent RPIM accelerator [15] also increases the amount of data in the analog domain for energy optimization. However, it only optimizes the computation energy (including the energy of interfacing circuits).
Viii Conclusions
In this paper, we analyze existing designs of RPIM accelerators and identify three opportunities to greatly enhance their energy efficiency: analog data locality, timedomain interfacing, and input access reduction. These three opportunities inspire three key features of TIMELY: (1) ALBs, (2) interfacing with TDCs/DTCs, and (3) an OIR mapping method. TIMELY outperforms stateoftheart in both energy efficiency and computational density while maintaining a better throughput.
Acknowledgment
The authors would like to thank Dr. Rajeev Balasubramonian and Dr. Anirban Nag for their discussions on ISAAC [58].
References
 [1] “NVIDIA Deep Learning Accelerator (NVDLA).” [Online]. Available: http://nvdla.org/primer.html
 [2] K. Amanpreet and P. Rishikesh, “Current mode computational circuits for analog signal processing,” International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 3, no. 4, 2014.

[3]
A. Amravati, S. B. Nasir, S. Thangadurai, I. Yoon, and A. Raychowdhury, “A 55nm timedomain mixedsignal neuromorphic accelerator with stochastic synapses and embedded reinforcement learning for autonomous microrobots,” in
2018 IEEE International Solid  State Circuits Conference  (ISSCC), 2018.  [4] A. Amravati, S. B. Nasir, S. Thangadurai, I. Yoon, and A. Raychowdhury, “A 55nm timedomain mixedsignal neuromorphic accelerator with stochastic synapses and embedded reinforcement learning for autonomous microrobots,” in 2018 IEEE International SolidState Circuits Conference(ISSCC), 2018.

[5]
M. Bavandpour, M. R. Mahmoodi, and D. B. Strukov, “Energyefficient timedomain vectorbymatrix multiplier for neurocomputing and beyond,”
IEEE Transactions on Circuits and Systems II: Express Briefs, 2019. 
[6]
A. Biswas and A. P. Chandrakasan, “Convram: An energyefficient sram with embedded convolution computation for lowpower cnnbased machine learning applications,” in
2018 IEEE International Solid  State Circuits Conference  (ISSCC), 2018.  [7] I. Bratt, “Arm’s FirstGeneration Machine Learning Processor,” Hot Chips, 2018. [Online]. Available: https://www.hotchips.org/hc30/2conf/2.07_ARM_ML_Processor_HC30_ARM_2018_08_17.pdf
 [8] N. Cao, M. Chang, and A. Raychowdhury, “14.1 a 65nm 1.1to9.1tops/w hybriddigitalmixedsignal computing platform for accelerating modelbased and modelfree swarm robotics,” in 2019 IEEE International Solid State Circuits Conference  (ISSCC), 2019.
 [9] L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang, “Acceleratorfriendly neuralnetwork training: Learning variations and defects in rram crossbar,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, March 2017.

[10]
Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” in
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.  [11] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machinelearning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014.
 [12] Z. Chen and J. Gu, “19.7 a scalable pipelined timedomain dtw engine for timeseries classification using multibit time flipflops with 140gigacellupdates/s throughput,” in 2019 IEEE International SolidState Circuits Conference  (ISSCC), 2019.
 [13] S. N. e. a. Chen T, Du Z, “Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning,” ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.
 [14] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processinginmemory architecture for neural network computation in rerambased main memory,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
 [15] T. Chou, W. Tang, J. Botimer, and Z. Zhang, “Cascade: Connecting rrams to extend analog dataflow in an endtoend inmemory processing paradigm,” in Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: ACM, 2019, pp. 114–125. [Online]. Available: http://doi.acm.org/10.1145/3352460.3358328
 [16] M. Daisuke, Y. Ryo, H. Kazunori, K. Hiroyuki, K. Shouhei, O. Yukihito, and U. Yasuo, “A 10.4 pj/b (32, 8) ldpc decoder with timedomain analog and digital mixedsignal processing,” in 2013 IEEE International SolidState Circuits Conference Digest of Technical Papers, 2013, pp. 420–421.
 [17] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009.

[18]
L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],”
IEEE Signal Processing Magazine, vol. 29, no. 6, 2012.  [19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015.
 [20] P. Dudek, S. Szczepanski, and J. V. Hatfield, “A highresolution cmos timetodigital converter utilizing a vernier delay line,” IEEE Journal of SolidState Circuits, vol. 35, no. 2, pp. 240–247, Feb 2000.
 [21] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt, “A high memory bandwidth fpga accelerator for sparse matrixvector multiplication,” in 2014 IEEE 22nd Annual International Symposium on FieldProgrammable Custom Computing Machines, 2014.
 [22] B. P. Ginsburg and A. P. Chandrakasan, “500ms/s 5bit adc in 65nm cmos with split capacitor array dac,” IEEE Journal of SolidState Circuits, vol. 42, no. 4, April 2007.
 [23] S. K. Gonugondla, M. Kang, and N. Shanbhag, “A 42pj/decision 3.12tops/w robust inmemory machine learning classifier with onchip training,” in 2018 IEEE International Solid  State Circuits Conference  (ISSCC), Feb 2018.
 [24] A. Grossi, E. Vianello, M. M. Sabry, M. Barlas, L. Grenouillet, J. Coignus, E. Beigne, T. Wu, B. Q. Le, M. K. Wootters, C. Zambelli, E. Nowak, and S. Mitra, “Resistive ram endurance: Arraylevel characterization and correction techniques targeting deep learning applications,” IEEE Transactions on Electron Devices, vol. PP, 2019.
 [25] M. H. e. a. Han S, Liu X, “Eie: efficient inference engine on compressed deep neural network,” ACM International Symposium on Computer Architecture, 2016.
 [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [27] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), 2014.
 [28] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J. J. Yang, and R. S. Williams, “Dotproduct engine for neuromorphic computing: Programming 1t1m crossbar to accelerate matrixvector multiplication,” in Proceedings of the 53rd annual design automation conference (DAC), 2016.
 [29] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
 [30] P. Jain, U. Arslan, M. Sekhar, B. C. Lin, L. Wei, T. Sahu, J. Alzatevinasco, A. Vangapaty, M. Meterelliyoz, N. Strutt, A. B. Chen, P. Hentges, P. A. Quintero, C. Connor, O. Golonzka, K. Fischer, and F. Hamzaoglu, “13.2 a 3.6mb 10.1mb/mm2 embedded nonvolatile reram macro in 22nm finfet technology with adaptive forming/set/reset schemes yielding down to 0.5v with sensing time of 5ns at 0.7v,” in 2019 IEEE International Solid State Circuits Conference  (ISSCC), Feb 2019.
 [31] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ser. ICCV ’15, 2015.
 [32] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,” arXiv preprint arXiv:1706.05137, 2017.
 [33] M. Kang, S. K. Gonugondla, M.S. Keel, and N. R. Shanbhag, “An energyefficient memorybased highthroughput vlsi architecture for convolutional networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 1037–1041.
 [34] M. Kang, M.S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, “An energyefficient vlsi architecture for pattern recognition via deep embedding of computation in sram,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 8326–8330.
 [35] A. Kaur and R. Pandey, “Current mode computational circuits for analog signal processing,” International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, 2014.
 [36] W. Khwa, J. Chen, J. Li, X. Si, E. Yang, X. Sun, R. Liu, P. Chen, Q. Li, S. Yu, and M. Chang, “A 65nm 4kb algorithmdependent computinginmemory sram unitmacro with 2.3ns and 55.8tops/w fully parallel productsum operation for binary dnn edge processors,” in 2018 IEEE International Solid  State Circuits Conference  (ISSCC), Feb 2018.
 [37] O. Krestinskaya, I. Fedorova, and A. P. James, “Memristor load current mirror circuit,” in 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2015.
 [38] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli, M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici, “A 3.1 mw 8b 1.2 gs/s singlechannel asynchronous sar adc with alternate comparators for enhanced speed in 32 nm digital soi cmos,” IEEE Journal of SolidState Circuits, 2013.
 [39] C. Lee, H. Lin, C. Lien, Y. Chih, and J. Chang, “A 1.4mb 40nm embedded reram macro with 0.07um2 bit cell, 2.7ma/100mhz lowpower read and hybrid write verify for high endurance application,” in 2017 IEEE Asian SolidState Circuits Conference (ASSCC), Nov 2017.
 [40] S. Levantino, G. Marzin, and C. Samori, “An adaptive predistortion technique to mitigate the dtc nonlinearity in digital plls,” IEEE Journal of SolidState Circuits, 2014.
 [41] M. Li, C. Yang, and Y. Ueng, “A 5.28gb/s ldpc decoder with timedomain signal processing for ieee 802.15.3c applications,” IEEE Journal of SolidState Circuits, 2017.
 [42] S. Li, A. O. Glova, X. Hu, P. Gu, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Scope: A stochastic computing engine for drambased insitu accelerator,” in Proceedings of the 51th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO51 ’18, 2018.
 [43] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A drambased reconfigurable insitu accelerator,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO50 ’17, 2017.
 [44] Z. Li, Y. Chen, L. Gong, L. Liu, D. Sylvester, D. Blaauw, and H. Kim, “An 879GOPS 243mw 80fps VGA fully visual cnnslam processor for widerange autonomous exploration,” in 2019 IEEE International Solid State Circuits Conference  (ISSCC), 2019, pp. 134–136.
 [45] Z. Li, J. Wang, D. Sylvester, D. Blaauw, and H. S. Kim, “A 1920 1080 25Frames/s 2.4TOPS/W lowpower 6D vision processor for unified optical flow and stereo depth with semiglobal matching,” IEEE Journal of SolidState Circuits, vol. 54, no. 4, pp. 1048–1058, 2019.
 [46] Y. Lin, S. Zhang, and N. Shanbhag, “VariationTolerant Architectures for Convolutional Neural Networks in the Near Threshold Voltage Regime,” in 2016 IEEE International Workshop on Signal Processing Systems (SiPS), Oct 2016, pp. 17–22.
 [47] C. Liu, S. Chang, G. Huang, and Y. Lin, “A 10bit 50ms/s sar adc with a monotonic capacitor switching procedure,” IEEE Journal of SolidState Circuits, vol. 45, no. 4, 2010.
 [48] C. Liu, M. Hu, J. P. Strachan, and H. Li, “Rescuing memristorbased neuromorphic design with high defects,” in 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), June 2017.
 [49] M. Liu, L. R. Everson, and C. H. Kim, “A scalable timebased integrateandfire neuromorphic core with braininspired leak and local lateral inhibition capabilities,” in 2017 IEEE Custom Integrated Circuits Conference (CICC), 2017.
 [50] S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, and J. Du, “Ondemand deep model compression for mobile devices: A usagedriven model selection framework,” in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2018, pp. 389–400.

[51]
A. Mantyniemi, T. Rahkonen, and J. Kostamovaara, “A cmos timetodigital converter (tdc) based on a cyclic time domain successive approximation interpolation method,”
IEEE Journal of SolidState Circuits, 2009.  [52] D. Miyashita, R. Yamaki, K. Hashiyoshi, H. Kobayashi, S. Kousai, Y. Oowaki, and Y. Unekawa, “An ldpc decoder with timedomain analog and digital mixedsignal processing,” IEEE Journal of SolidState Circuits, 2014.
 [53] A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018.
 [54] A. Polino, R. Pascanu, and D. Alistarh, “Model compression via distillation and quantization,” arXiv preprint arXiv:1802.05668, 2018.
 [55] M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, “Destiny: A tool for modeling emerging 3d nvm and edram caches,” in 2015 Design, Automation Test in Europe Conference Exhibition (DATE), 2015.
 [56] X. Qiao, X. Cao, H. Yang, L. Song, and H. Li, “Atomlayer: A universal rerambased cnn accelerator with atomic layer computation,” in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018.

[57]
S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch normalization help optimization?” in
Advances in Neural Information Processing Systems, 2018, pp. 2483–2493.  [58] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
 [59] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From highlevel deep neural models to fpgas,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
 [60] X. Si, J. Chen, Y. Tu, W. Huang, J. Wang, Y. Chiu, W. Wei, S. Wu, X. Sun, R. Liu, S. Yu, R. Liu, C. Hsieh, K. Tang, Q. Li, and M. Chang, “24.5 a twin8t sram computationinmemory macro for multiplebit cnnbased machine learning,” in 2019 IEEE International Solid State Circuits Conference  (ISSCC).
 [61] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” Proc. ICLR, 2015.
 [62] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined rerambased accelerator for deep learning,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017.
 [63] W. Tseng, J. Wu, and Y. Chu, “A cmos 8bit 1.6gs/s dac with digital random returntozero,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 58, no. 1, Jan 2011.
 [64] F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, “Rana: Towards efficient neural acceleration with refreshoptimized embedded dram,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
 [65] I. Valov, R. Waser, J. R Jameson, and M. N Kozicki, “Electrochemical metallization memories–fundamentals, applications, prospects,” Nanotechnology, vol. 22, 2011.
 [66] H. Wang, Y. Zhao, C. Li, Y. Wang, and Y. Lin, “A new MRAMbased process inmemory accelerator for efficient neural network training with floating point precision,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), May 2020, pp. 1–4.
 [67] Y. Wang, Z. Jiang, X. Chen, P. Xu, Y. Zhao, Y. Lin, and Z. Wang, “E2Train: Training Stateoftheart CNNs with Over 80% Energy Savings,” in Advances in Neural Information Processing Systems, 2019, pp. 5139–5151.
 [68] Y. Wang, J. Shen, T.K. Hu, P. Xu, T. Nguyen, R. Baraniuk, Z. Wang, and Y. Lin, “Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference,” IEEE Journal of Selected Topics in Signal Processing, 2019.
 [69] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox–based resistive switching memories–nanoionic mechanisms, prospects, and challenges,” Advanced Materials, vol. 21, 2009.
 [70] L. Weitao, L. Fule, and W. Zhihua, “Highresolution and highspeed integrated cmos ad converters for lowpower applications,” Springer, 2018.
 [71] H.. P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T. Chen, and M. Tsai, “Metal–oxide rram,” Proceedings of the IEEE, vol. 100, no. 6, 2012.
 [72] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, and Y. Lin, “Deep means: Retraining and parameter sharing with harder cluster assignments for compressing deep convolutions,” arXiv preprint arXiv:1806.09228, 2018.
 [73] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, D. Chen, and Y. Lin, “AutoDNNchip: An automated dnn chip predictor and builder for both FPGAs and ASICs,” The 2020 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, Feb 2020. [Online]. Available: http://dx.doi.org/10.1145/3373087.3375306
 [74] H. Yang, Y. Zhu, and J. Liu, “EndtoEnd Learning of EnergyConstrained Deep Neural Networks,” arXiv eprints, 2018.
 [75] J. Yang, Y. Kong, Z. Wang, Y. Liu, B. Wang, S. Yin, and L. Shi, “24.4 sandwichram: An energyefficient inmemory bwn architecture with pulsewidth modulation,” in 2019 IEEE International Solid State Circuits Conference  (ISSCC).
 [76] T.J. Yang, Y.H. Chen, and V. Sze, “Designing energyefficient convolutional neural networks using energyaware pruning,” arXiv preprint, 2017.
 [77] P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang, N. Deng, L. Shi, H.S. P. Wong, and H. Qian, “Face classification using electronic synapses,” Nature communications, 2017.
 [78] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and H. Qian, “Fully hardwareimplemented memristor convolutional neural network,” Nature, vol. 577, no. 7792, 2020.
 [79] S. Yu, “Neuroinspired computing with emerging nonvolatile memory,” Proceedings of the IEEE, 2018.
 [80] C. Zhang, J. Gu, L. Gao, T. Ouyang, and B. Wang, “Timedomain computing circuits for addition and multiplication computation,” in 2017 International Conference on Electron Devices and SolidState Circuits (EDSSC), 2017.
 [81] J. Zhang, Z. Wang, and N. Verma, “Inmemory computation of a machinelearning classifier in a standard 6t sram array,” IEEE Journal of SolidState Circuits, vol. 52, no. 4, pp. 915–924, 2017.
 [82] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambriconx: An accelerator for sparse neural networks,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
 [83] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.m. Hwu, and D. Chen, “Dnnbuilder: an automated tool for building highperformance dnn hardware accelerators for fpgas,” in Proceedings of the International Conference on ComputerAided Design. ACM, 2018, p. 56.
 [84] Y. Zhu, C. Chan, U. Chio, S. Sin, S. U, R. P. Martins, and F. Maloberti, “A 10bit 100ms/s referencefree sar adc in 90 nm cmos,” IEEE Journal of SolidState Circuits, vol. 45, no. 6, 2010.
Comments
There are no comments yet.