I Introduction
Accelerators are in vogue today, primarily because it is evident that annual performance improvements can be sustained via specialization. There are also many emerging applications that demand highthroughput lowenergy hardware, such as the machine learning tasks that are becoming commonplace in enterprise servers, selfdriving cars, and mobile devices. The last two years have seen a flurry of activity in designing machine learning accelerators
[5, 7, 9, 19, 8, 30, 27, 24, 16]. Similar to our work, most of these recent works have focused on inference in artificial neural networks, and specifically deep convolutional networks, that achieve stateoftheart accuracies on challenging image classification workloads.While most of these recent accelerators have used digital architectures [5, 7], a few have leveraged analog acceleration on memristor crossbars [26, 8, 4]. Such accelerators take advantage of insitu computation to dramatically reduce data movement costs. Each crossbar is assigned to execute parts of the neural network computation and programmed with the corresponding weight values. Input neuron values are fed to the crossbar, and by leveraging Kirchoff’s Law, the crossbar outputs the corresponding dot product. The neuron output undergoes analogtodigital conversion (ADC) before being sent to the next layer. Multiple smallscale prototypes of this approach have also been demonstrated [1, 21].
The design constraints for digital accelerators are very different from their analog designs. High communication overhead and the memory bottleneck are still first order design constraints in digital, whereas the computation overhead arising from analogtodigital and digitaltoanalog conversions, and balancing the extent of digital computation in an analog architecture are more critical in analog accelerators. In this work, we show that computation is a critical problem in analog and leverage numeric algorithms to reduce conversion overheads. Once we improve the efficiency of computation, the next major overhead comes from communication and storage. Towards this end, we discuss mapping techniques and buffer management strategies to further improve analog accelerator efficiency.
With these innovations in place, our new design, Newton, moves the analog architecture closer to the bare minimum energy required to process one neuron. We define an ideal neuron as one that keeps the weight inplace adjacent to a digital ALU, retrieves the input from an adjacent singlerow eDRAM unit, and after performing one digital operation, writes the result to another adjacent singlerow eDRAM unit. This energy is lower than that for a similarly ideal analog neuron because of the ADC cost. This ideal neuron operation consumes 0.33 pJ. An average DaDianNao operation consumes 3.5 pJ because it pays a high price in data movement for inputs and weights. ISAAC [26] is a stateoftheart analog design that achieves an order of magnitude better performance than digital accelerators such as DaDianNao. An average ISAAC operation consumes 1.8 pJ because it pays a moderate price in data movement for inputs (weights are insitu) and a high price for ADC. An average Eyeriss [6] operation consumes 1.67 pJ because of an improved dataflow to maximize reuse. The innovations in Newton push the analog architecture closer to the ideal neuron by consuming 0.85 pJ per operation. Relative to ISAAC, Newton achieves a 77% decrease in power, 51% decrease in energy, and increase in throughput/area.
Ii Background
Iia Workloads
We consider different CNNs presented in the ILSVRC challenge of image classification for the IMAGENET
[25] dataset. The suite of benchmarks considered in this paper is representative of the various dataflows in such image classification networks. For example, Alexnet is the simplest of CNNs with a reasonable accuracy, where a few convolution layers at the start extract features from the image, followed by fully connected layers that classify the image. The other networks were designed with a similar structure but made deeper and wider with more parameters. For example, MSRA Prelunet [13] has 14 more layers than Alexnet [17] and has 330 million parameters, which is 5.5 higher than Alexnet. On the other hand, residual nets have forward connections with hops, i.e., output of a layer is passed on to not only the next layer but subsequent layers. Even though the number of parameters in Resnets [12] are much lower, these networks are much deeper and have a different dataflow, which changes the buffering requirements in accelerator pipelines.IiB The Landscape of CNN Accelerators
Digital Accelerators. The DianNao [5] and DaDianNao [7] accelerators were among the first to target deep convolutional networks at scale. DianNao designs the digital circuits for a basic NFU (Neural Functional Unit). DaDianNao is a tiled architecture where each tile has an NFU and eDRAM banks that feed synaptic weights to that NFU. DaDianNao uses many tiles on many chips to parallelize the processing of a single network layer. Once that layer is processed, all the tiles then move on to processing the next layer in parallel. Recent papers, e.g., Cnvlutin [2], have modified DaDianNao so the NFU does not waste time and energy processing zerovalued inputs. EIE [27] and Minerva [24] address sparsity in the weights. Eyeriss [6] and ShiDianNao [9] improve the NFU dataflow to maximize operand reuse. A number of other digital designs [16, 20, 10] have also emerged in the past year.
Analog Accelerators. Two CNN accelerators introduced in the past year, ISAAC [26] and PRIME [8], have leveraged memristor crossbars to perform dot product operations in the analog domain. We will focus on ISAAC here because it outperforms PRIME in terms of throughput, accuracy, and ability to handle signed values. ISAAC is also able to achieve nearly 8 and 5 higher throughput than digital accelerators DaDianNao and Cnvlutin respectively.
IiC Isaac
Pipeline of Memristive Crossbars. In ISAAC, memristive crossbar arrays are used to perform analog dotproduct operations. Neuron inputs are provided as voltages to wordlines; neuron weights are represented by preprogrammed cell conductances; neuron outputs are represented by the currents in each bitline. The neuron outputs are processed by an ADC and shiftandadd circuits. They are then sent as inputs to the next layer of neurons. As shown in Figure 1, ISAAC is a tiled architecture; one or more tiles are dedicated to process one layer of the neural network. To perform inference for one input image, neuron outputs are propagated from tile to tile until all network layers have been processed.
Tiles, IMAs, Crossbars. An ISAAC chip consists of many tiles connected in a mesh topology (Figure 1). Each tile includes an eDRAM buffer that supplies inputs to Insitu Multiply Accumulate (IMA) units. The IMA units consist of memristor crossbars that perform the dotproduct computation, ADCs, and shiftandadd circuits that accumulate the digitized results. With a design space exploration, the tile is provisioned with an optimal number of IMAs, crossbars, ADCs, etc. Within a crossbar, a 16bit weight is stored 2 bits per cell, across 8 columns. A 16bit input is supplied as voltages over 16 cycles, 1 bit per cycle, using a trivial DAC array. The partial outputs are shifted and added across 8 columns, and across 16 cycles to give the output of MAC operations. Thus, there are two levels of pipelining in ISAAC: (i) the intratile pipeline, where inputs are read from eDRAM, processed by crossbars in 16 cycles, and aggregated, (ii) the intertile pipeline, where neuron outputs are transferred from one layer to the next. The intratile pipeline has a cycle time of 100 ns, matching the latency for a crossbar read. Inputs are sent to a crossbar in an IMA using an input htree network. The input htree has sufficient bandwidth to keep all crossbars active without bubbles. Each crossbar has a dedicated ADC operating at 1.28 shared across its 128 bitlines to convert the analog output to digital in 100 ns. An htree network is then used to collect digitized outputs from crossbars.
Crossbar Challenges. As with any new technology, a memristor crossbar has unique challenges, mainly in two respects. First, mapping a matrix onto a memristor crossbar array requires programming (or writing) cells with the highest precision possible. Second, real circuits deviate from ideal operation due to parasitics such as wire resistance, device variation, and write/read noise. All of these factors can cause the actual output to deviate from its ideal value. Recent work [14] has captured many of these details to show the viability of prototypes [1]. The Appendix summarizes some of these details.
Iii Proposal
The design constraints for digital accelerators are very different from their analog counterparts. In any digital design, the overhead of communication, arising from the need to fetch both input feature and weights from memory, is the major limiting factor. As a result, most optimizations focus on improving memory bandwidth (e.g., HBM or GDDR), memory utilization (compression, zero value elimination, etc.) and scheduling (e.g., batching) to improve communication efficiency and performance. Techniques that primarily target improving digital computation should carefully consider their impact on additional onchip storage and communication overheads, which can negatively affect overall efficiency.
In analog, because we do insitu computation, only one of the operands needs to be transferred, and this reduces the communication overhead by at least 2. Furthermore, the transferred value (the input vector in the form of crossbar row voltage) is streamed across the entire crossbar (matrix values) guaranteeing high reuse and locality. The compute density of analog insitu units is also better than digital accelerators. As analog crossbars store neural network weights, even if they are not performing computation, they still act as onchip storage. Whereas, digital computational units need to have high utilization to maximize performance, otherwise their area is better utilized for more onchip storage. Both these factors provide more flexibility for analog accelerators to explore computational optimizations at the expense of either more communication or crossbar storage.
In a digital design, the datapath size and its overhead are predetermined. A 16bit datapath operated with 12bit values will achieve only marginal reduction in overhead as pipeline buffers and wire repeaters switch every cycle. However, as analog computation is being performed at bit level (1 or 2 bit computations in each bitline), reducing the operand size, say, from 16bits to 12bits will correspondingly reduce ADC and DAC usage, leading to better efficiency. Note that even though an analog architecture consists of both digital and analog computations, the overhead of analog dominates  61% of the total power [26].
We will first take a closer look at a simple dotproduct being performed using crossbars. Consider a 1128 vector being multiplied with a 128128 matrix (all values are 16 bits). Figure 2 shows the energy breakdown of the vectormatrix multiplication pipeline compared against digital designs for various architectures. To model the analog overhead, we consider 2bit cells, 1bit DAC, and 16bit values interleaved across eight crossbars. In a single iteration, a crossbar column is performing a dotproduct involving 128 rows, 1bit inputs, and 2bit cells; it therefore produces a 9bit result requiring a 9bit ADC^{1}^{1}1 Prior work (ISAAC) has shown that simple data encoding schemes can reduce the ADC resolution by 1 bit [26]. .
We must shift and add the results of eight such columns, yielding a 23bit result. These results must also be shifted and added across 16 iterations, finally yielding a 39bit output. Finally, the scaling factor is applied to convert the 39bit result to a 16bit output. As the figure shows, communication and memory accesses are the major limiting factor for digital architectures, whereas for analog, computation overhead, primarily arising from ADC dominates.
Based on these observations, we present optimizations that exploit high compute density and flexible datapath enabled by analog to improve computation efficiency. These optimizations are applicable to any accelerator that uses analog insitu crossbars as the techniques primarily target high ADC overhead. Once we improve the efficiency of the computation, the next major overhead comes from communication of values. As communication (onchip and offchip) and storage overheads (SRAM or eDRAM buffers) depend on the overall accelerator architecture, we choose the ISAAC architecture as the baseline when discussing our optimizations.
Iiia Reducing Computational Overhead
IiiA1 Karatsuba’s Divide and Conquer Multiplication Technique
With ADC being the major contributor to the total power, we discuss a divide and conquer strategy at the bit level, that reduces pressure on ADC usage and hence ADC power. A classic multiplication approach for two nbit numbers has a complexity of where each bit of a number is multiplied with nbits of the other number, and the partial results are shifted and added to get the final 2nbit result.
Karatsuba’s divide and conquer algorithm manages to reduce the complexity from to . As shown in Figure 3, it divides the numbers into two halves of n/2 bits, MSB bits and LSB bits, and instead of performing four smaller n/2bit multiplications, it calculates the result with two n/2bit multiplications and one (n/2 + 1)bit multiplication.
To illustrate the benefit of this technique, consider the same example discussed earlier using 128x128 crossbars, 2bit cells, and 1bit DAC. The product of input and weight is performed on 8 crossbars in 16 cycles (since each weight is spread across 8 cells in 8 different crossbars and the input is spread across 16 iterations). In the example in Figure 3, is performed on four crossbars in 8 iterations (since we are dealing with fewer bits for weights and inputs). The same is true for . A third set of crossbars stores the weights and receives the precomputed inputs . This computation is spread across 5 crossbars and 9 iterations. We see that the total amount of work has reduced by 15%.
There are a few drawbacks as well. A computation now takes 17 iterations instead of 16. The net area increases because the network must send inputs and in parallel, an additional crossbar is needed, the output buffer is larger to store subproducts, and 128 1bit full adders are required to compute . Again, given that the ADC is the primary bottleneck, these other overheads are relatively minor.
IiiA2 Strassen’s Algorithm
A divide and conquer approach can also be applied to matrixmatrix multiplication. By partitioning each matrix and into 4 submatrices, we can express matrixmatrix multiplication in terms of multiplications of submatrices. A typical algorithm would require 8 submatrix multiplications, followed by an aggregation step. But as shown in Figure 4, linear algebra manipulations can perform the same computation with 7 submatrix multiplications, with appropriate pre and post processing. Similar to Karatsuba’s algorithm, this has the advantage of reducing ADC usage and power.
The above two optimizations reduce the computation energy by 20.6% while incurring a storage overhead of 4.3%. While both divide and conquer algorithms (Karatsuba’s and Strassen’s algorithms) are highly effective for a crossbarbased architecture, they have very little impact on other digital accelerators. For example, these algorithms may impact the efficiency of the NFUs in DaDianNao, but DaDianNao area is dominated by eDRAM banks and not NFUs. In fact, Strassen’s algorithm can lower DaDianNao efficiency because buffering requirements may increase. On the other hand, analog computations are dominated by ADCs, so efficient computation does noticeably impact overall efficiency. Further, some of the preprocessing for these algorithms is performed when installing weights on analog crossbars, but has to be performed onthefly for digital accelerators.
IiiA3 Adaptive ADCs.
A simple dotproduct operation on 16bit values performed using crossbars typically result in an output of more than 16bits. In the example discussed earlier, using 2bit cells in crossbars and 1bit DACs yielded 39bit output. Once the scaling factor is applied, the least significant 10 bits are dropped. The most significant 13 bits represent an overflow that cannot be captured in the 16bit result, so they are effectively used to clamp the result to a maximum value.
What is of note here is that the output from every crossbar column in every iteration is being resolved with a highprecision 9bit ADC, but many of these bits contribute to either the 10 least significant bits or the 13 most significant bits that are eventually going to be ignored. This is an opportunity to lower the ADC precision and ignore some bits, depending on the column and the iteration being processed. Figure 5 shows the number of relevant bits emerging from every column in every iteration. Note that before dropping the highest ignored least significant bit, we use rounding modes to generate carries, similar to [11].
The ADC accounts for a significant fraction of IMA power. When the ADC is operating at a lower resolution, it has less work to do. In every 100 ns iteration, we tune the resolution of a SAR ADC to match the requirement in Figure 5. Thus, the use of adaptive ADCs helps reduce IMA power while having no impact on performance. We are also ignoring bits that do not show up in a 16bit fixedpoint result, so we are not impacting the functional behavior of the algorithm, thus having zero impact on algorithm accuracy.
A SAR ADC does a binary search over the input voltage to find the digital value, starting from the MSB. A bit is set to 1, and the resulting digital value is converted to analog and compared with the input voltage. If the input voltage is higher, the bit is set to one, the next bit is changed, and the process repeats. If the number of bits to be sampled is reduced, the circuit can ignore the latter stages. The ADC simply gates off its circuits until the next sample is provided. It is important to note that the ADC starts the binary search from the MSB, thus it is not possible to sample just the lower significant bits of an output without knowing the MSBs. But in this case, we have a unique advantage: if any of the MSBs to be truncated is 1, then the output neuron value is clamped to the highest value in the fixed point range. Thus, in order to sample a set of LSBs, the ADC starts the binary search with the LSB+1 bit. If that comparison yields true, it means at least one of the MSB bits is one. This signal is sent across the intercrossbar network (e.g. HTree) and the output is clamped.
In conventional SAR ADCs [29], a third of the power is dissipated in the capacitive DAC (CDAC), a third in digital circuits, and a third in other analog circuits. The MSB decision in general consumes more power because it involves charging up the CDAC at the end of every sampling iteration. Recent trends show CDAC power diminishing due to use of tiny unit capacitances (about 2fF) and innovative reference buffer designs, leading to ADCs consuming more power in analog and digital circuits [18, 23]. The Adaptive ADC technique is able to reduce energy consumption irrespective of the ADC design since it eliminates both LSB and MSB tests across the 16 iterations.
IiiB Communication and Storage Optimizations
So far, we have discussed optimization techniques that are applicable to any analog crossbar architectures. To further improve analog accelerator efficiency, it is critical to also reduce communication and storage overhead. As the effectiveness of optimizing communication varies based on the overall architecture, we describe our proposals in the context of the ISAAC architecture. Similar to ISAAC, we employ a tiled architecture, where every tile is composed of several IMAs. A set of IMAs along with digital computational units and eDRAM storage form a tile.
The previous subsection focused on techniques to improve an IMA; we now shift our focus to the design of a tile. We first reduce the size of the buffer in each tile that feeds all its IMAs. We then create heterogeneous tiles that suit convolutional and fullyconnected layers.
IiiB1 Reducing Buffer Sizes.
Because ISAAC did not place constraints on how layers are mapped to crossbars and tiles, the eDRAM buffer was sized to 64KB to accommodate the worstcase requirements of workloads. Here, we design mapping techniques that reduce storage requirements per tile and move that requirement closer to the averagecase.
To explain the impact of mapping on buffering requirements, first consider the convolutional layer shown in Figure 6a. Once a certain number of inputs are buffered (shown in green and pink), the layer enters steady state; every new input pixel allows the convolution to advance by another step. The buffer size is a constant as the convolution advances (each new input evicts an old input that is no longer required). In every step, a subset of the input buffer is fed as input to the crossbar to produce one pixel in each of many output feature maps. If the crossbar is large, it is split across 2 tiles, as shown in Figure 6a. The split is done so that Tile 1 manages the green buffer and green inputs, and Tile 2 manages the pink buffer and pink inputs. Such a split means that inputs do not have to be replicated on both tiles, and buffering requirements are low.
Now, consider an early convolutional layer. Early convolutional layers have more work to do than later layers since they deal with larger feature maps. In ISAAC, to make the pipeline balanced, early convolutional layers are replicated so their throughput matches those of later layers. Figure 6
b replicates the crossbar; one is responsible for every odd pixel in the output feature maps, while the other is responsible for every even pixel. In any step, both crossbars receive very similar inputs. So the same input buffer can feed both crossbars.
If a replicated layer is large enough that it must be spread across (say) 4 tiles, we have two options. Figure 6 c and d show these two options. If the odd computation is spread across two tiles (1 and 2) and the even computation is spread across two different tiles (3 and 4), the same green inputs have to be sent to Tile 1 and Tile 3, i.e., the input buffers are replicated. Instead, as shown in Figure 6d, if we colocate the top quadrant of the odd computation and the top quadrant of the even computation in Tile 1, the green inputs are consumed entirely within Tile 1 and do not have to be replicated. This partitioning leads to the minimum buffer requirement.
The bottomline from this mapping is that when a layer is replicated, the buffering requirements per neuron and per tile are reduced. This is because multiple neurons that receive similar inputs can reuse the contents of the input buffer. Therefore, heavily replicated (early) layers have lower buffer requirements per tile than lightly replicated (later) layers. If we mapped these layers to tiles as shown in Figure 7a, the worstcase buffering requirement goes up (64 KB for the last layer), and early layers end up underutilizing their 64 KB buffer. To reduce the worstcase requirement and the underutilization, we instead map layers to tiles as shown in Figure 7b. Every layer is finely partitioned and spread across 10 tiles, and every tile maps part of a layer. By spreading each layer across many tiles, every tile can enjoy the buffering efficiency of early layers. By moving every tile’s buffer requirement closer to the averagecase (21 KB in this example), we can design a tile with a smaller eDRAM buffer (21 KB instead of 64 KB) that achieves higher overall computational efficiency. This has minimal impact on intertile neuron communication because adjacent layers are mapped to the same tile and hence, even though a single layer is distributed across multiple tiles, the neurons being communicated across layers have to typically travel short distances.
IiiB2 Different Tiles for Convolutions and Classifiers.
While ISAAC uses the same homogeneous tile for the entire chip, we observe that convolutional layers have very different resource demands than fullyconnected classifier layers. The classifier (or FC) layer has to aggregate a set of inputs required by a set of crossbars; the crossbars then perform their computation; the inputs are discarded and a new set of inputs is aggregated. This results in the following properties for the classifier layer:

The classifier layer has a high communicationtocompute ratio, so the router bandwidth puts a limit on how often the crossbars can be busy.

The classifier also has the highest synaptic weight requirement because every neuron has private weights.

The classifier has low buffering requirements – an input is seen by several neurons in parallel, and the input can be discarded right after.
We therefore design special tiles customized for classifier layers that:

have a higher crossbartoADC ratio (4:1 instead of 1:1),

operate the ADC at a lower rate (10 Msamples/sec instead of 1.2 Gsamples/sec),

have a smaller eDRAM buffer size (4 KB instead of 16 KB).
For smallscale workloads that are trying to fit on a single chip, we would design a chip where many of the tiles are convtiles and some are classifiertiles (a ratio of 1:1 is a good fit for most of our workloads). For largescale workloads that use multiple chips, each chip can be homogeneous; we use roughly an equal number of convchips and classifierchips. The results consider both cases.
IiiC Putting the Pieces Together
We use ISAAC as the baseline architecture and evaluate the proposed optimizations by enhancing it. We already presented a general overview of ISAAC in section 1. We make two key enhancements to ISAAC to improve both area and compute efficiencies. Note that these two optimizations are specific to ISAAC architecture, and following this, we present implementation details of numerical algorithms discussed in previous subsections.
First, ISAAC did not place any constraints on how a neural network can be mapped to its many tiles and IMAs. As a result, its resources, notably the HTree and buffers within an IMA, are provisioned to handle the worst case. This has a negative impact on power and area efficiency. Instead, we place constraints on how the workload is mapped to IMAs. While this inflexibility can waste a few resources, we observe that it also significantly reduces the HTree size and hence area per IMA. The architecture is still generalpurpose, i.e., arbitrary CNNs can be mapped to it.
Second, within an IMA, we colocate an ADC with each crossbar. The digitized outputs are then sent to the IMA’s output register via an HTree network. While ISAAC was agnostic to how a single synaptic weight was scattered across multiple bitlines, we adopt the following approach to boost efficiency. A 16bit weight is scattered across 8 2bit cells; each cell is placed in a different crossbar. Therefore, crossbars 0 and 8 are responsible for the least significant bits of every weight, and crossbars 7 and 15 are responsible for the most significant bits of every weight. We also embed the shiftandadd units in the HTree. So the shiftandadd unit at the leaf of the HTree adds the digitized 9bit dotproduct results emerging from two neighboring crossbars. Because the operation is a shiftandadd, it produces a 11bit result. The next shiftandadd unit takes 2 11bit inputs to produce a 13bit input, and so on. We further modify mapping by placing the constraint that an IMA cannot be shared by multiple network layers.
To implement Karatsuba’s algorithm, we modify the Insitu Multiply Accumulate units (IMA) as shown in Figure 9. The changes are localized to a single mat. Each mat now has two crossbars that share the DAC and ADC. Given the size of the ADC, the extra crossbar per mat has a minimal impact on area. The left crossbars in four of the mats now store (Figure 3); the left crossbars in the other four mats store ; the right crossbars in five of the mats store ; the right crossbars in three of the mats are unused. In the first 8 iterations, the 8 ADCs are used by the left crossbars. In the next 9 iterations, 5 ADCs are used by the right crossbars. As discussed earlier, the main objective here is to lower power by reducing use of the ADC. Divide & Conquer can be recursively applied further. When applied again, the computation keeps 8 ADCs busy in the first 4 iterations, and 6 ADCs in the next 10 iterations. This is a 28% reduction in ADC use, and a 13% reduction in execution time. But, we pay an area penalty because 20 crossbars are needed per IMA. Figure 8 shows the mapping of computations within IMA to implement Strassen’s algorithm. The computations () in Strassen’s algorithm (Figure 4) are mapped to 7 IMAs in the tile. The 8th IMA can be allocated to another layer’s computation.
With all these changes targeting high compute efficiency and low communication and storage overhead, we refer to the updated analog design as the Newton architecture.
Iv Methodology
Modeling Area and Energy
For modeling the energy and area of the eDRAM buffers and onchip interconnect like the HTree and tile bus, we use CACTI 6.5 [22] at 32 nm. The area and energy model of a memristor crossbar is based on [14]. We adapt the area and energy of shiftandadd circuits, max/average pooling block and sigmoid operation similar to the analysis in DaDianNao [7] and ISAAC [26]. We avail the same HyperTransport serial link model for offchip interconnects as used by DaDianNao [7] and ISAAC [7]. The router area and energy is modeled using Orion 2.0 [15]. While our buffers can also be implemented with SRAM, we use eDRAM to make an applestoapples comparison with the ISAAC baseline. Newton is only used for inference, with a delay of 16.4 ms to preload weights in a chip.
In order to model the ADC energy and area, we use a recent survey [23] of ADC circuits published in different circuit conferences. The Newton architecture uses the same 8bit ADC [18] at 32 nm as used in ISAAC, partly because it yields the best configuration in terms of area/power and meets the sampling frequency requirement, and partly because it can be reconfigured for different resolutions. This is at the cost of minimal increase in area of the ADC. We scale the ADC power with respect to different sampling frequency according to another work by Kull et al. [18]. The SAR ADC has six different components: comparators, asynchronous clock logic, sampling clock logic, data memory and state logic, reference buffer, and capacitive DAC. The ADC power for different sampling resolution is modeled by gating off the other components except the sampling clock.
We consider a 1bit DAC as used in ISAAC because it is relatively small and has high SNR value. Since DAC is used in every row of the crossbar, a 1bit DAC improves the area efficiency.
The key parameters in the architecture that largely contribute to our analysis are reported in Table I.
This work considers recent workloads with stateoftheart accuracy in image classification tasks (summarized in Table II
). We create an analytic model for a Newton pipeline within an IMA and within a tile and map the suite of benchmarks, making sure that there are no structural hazards in any of these pipelines. We consider network bandwidth limitations in our simulation model to estimate throughput. Since ISAAC is a throughput architecture, we do an isothroughput comparison of the Newton architecture with ISAAC for the different intraIMA or intratile optimizations. Since the dataflow in the architecture is bounded by the router bandwidth, in each case, we allocate enough resources till the network saturates to create our baseline model. For subsequent optimizations, we retain the same throughput. Similar to ISAAC, data transfers between tiles onchip, and on the HT link across chips have been statically routed to make it conflict free. Like ISAAC, the latency and throughput of Newton for the given benchmarks can be calculated analytically using a deterministic execution model. Since there aren’t any runtime dependencies on the control flow or data flow of the deep networks, analytical estimates are enough to capture the behavior of cycleaccurate simulations.
We create a similar model for ISAAC, taking into considerations all the parameters mentioned in their paper.
Design Points
The Newton architecture can be designed by optimizing one of the following two metrics:

CE: Computational Efficiency which is the number of fixed point operations performed per second per unit area, .

PE: Power Efficiency which is the number of fixed point operations performed per second per unit power, .
For every considered innovation, we model Newton for a variety of design points that vary crossbar size, number of crossbars per IMA, and number of IMAs per tile. In most cases, the same configurations emerged as the best. We therefore focus most of our analysis on this optimal configuration that has 16 IMAs per tile, where each IMA uses 16 crossbars to process 128 inputs for 256 neurons. We report the area, power, and energy improvement for all the deep neural networks in our benchmark suite.
Component  Spec  Power  Area () 
Router  32 flits, 8 ports  168 mW  0.604 
ADC  8bit resolution  3.1 mW  0.0015 
1.2 GSps frequency  
Hyper Tr  4 links @ 1.6GHz  10.4 W  22.88 
6.4 GB/s link bw  
DAC array  128 1bit resolution  0.5 mW  0.00002 
number  
Memristor crossbar  0.3 mW  0.0001 
input  Alexnet  VGGA  VGGB  VGGC  VGGD  MSRAA  MSRAB  MSRAC  Resnet34 
size  [17]  [28]  [28]  [28]  [28]  [13]  [13]  [13]  [12] 
224  11x11, 96 (4)  3x3,64 (1)  3x3,64 (2)  3x3,64 (2)  3x3,64 (2)  7x7,96/2(1)  7x7,96/2(1)  7x7,96/2(1)  7x7,64/2 
3x3 pool/2  2x2 pool/2  3x3 pool/2  
112  3x3,128 (1)  3x3,128 (2)  3x3,128 (2)  3x3,128 (2)  
2x2 pool/2  
56  3x3,256 (2)  3x3,256 (2)  3x3,256 (3)  3x3,256 (4)  3x3,256 (5)  3x3,256 (6)  3x3,384 (6)  3x3,64 (6)  
1x1, 256(1)  
2x2 pool/2  3x3,128/2(1)  
28  5x5,256 (1)  3x3,512 (2)  3x3,512 (2)  3x3,512 (3)  3x3,512 (4)  3x3,512 (5)  3x3,512 (6)  3x3,768 (6)  3x3,128 (7) 
1x1,256 (1)  
3x3 pool/2  2x2 pool/2  3x3,256/2 (1)  
14  3x3,384 (2)  3x3,512 (2)  3x3,512 (2)  3x3,512 (3)  3x3,512 (4)  3x3,512 (5)  3x3,512 (6)  3x3,896 (6)  3x3,256 (11) 
3x3,256 (1)  1x1,512 (1)  
3x3 pool/2  2x2 pool/2  spp,7,3,2,1  3x3,512/2 (1)  
7  FC4096(2)  3x3,512 (5)  
FC1000(1) 
/stride (t), where t is the number of such layers. Stride is 1 unless explicitly mentioned. Layer* denotes convolution layer with private kernels.
V Results
The Newton architecture takes the baseline analog accelerator ISAAC and incrementally applies innovations discussed earlier. We begin by describing results for optimizations targeting global components such as htree followed by tile and IMA level techniques. As mentioned earlier, while we build on ISAAC and use it for evaluation, the proposed enhancements to crossbar are applicable to any analog architecture.
Constrained Mapping for Compact HTree
We first observe that the ISAAC IMA is designed with an overprovisioned HTree that can handle a worstcase mapping of the workload. We imposed the constraint that an IMA can only handle a single layer, and a maximum of 128 inputs. This restricts the width of the HTree, promotes input sharing, and enables reduction of partial neuron values at the junctions of the HTree. While this helps shrink the size of an IMA, it suffers from crossbar underutilization within an IMA. We consider different IMA sizes, ranging from which supplies the same 128 neurons to 4 crossbars to get 64 output neurons, to . Figure 10 plots the average underutilization of crossbars across the different workloads in the benchmark suite. For larger IMA sizes, the underutilization is quite significant. Larger IMA sizes also result in complex HTrees. Therefore, a moderately sized IMA that processes 128 inputs for 256 neurons has high computational efficiency and low crossbar underutilization. For this design, the underutilization is only 9%. Figure 11 quantifies how our constrained mapping and compact HTree improve area, power, and energy per workload. In short, our constraints have improved area efficiency by 37% and power/energy efficiency by 18%, while leaving only 9% of crossbars underutilized.
Heterogeneous ADC Sampling
The heteregenous sampling of outputs using adaptive ADCs has a big impact on reducing the power profile of the analog accelerator. In one iteration of 100 ns, at max 4 ADCs work at the max resolution of 8bits. Power supply to the rest of the ADCs can be reduced. We measure the reduction of area, power, and energy with respect to the new IMA design with the compact HTree. Since ADC contributed to 49% of the chip power in ISAAC, reducing the oversampling of ADC reduces power requirement by 15% on average. The area efficiency improves as well since the outputHTree now carries 16bits instead of unnecessarily carrying 39bits of final output. The improvements are shown in Figure 12.
Karatsuba’s Algorithm
We further try to reduce the power profile with divideandconquer within an IMA. Figure 13 shows the impact of recursively applying the divideandconquer technique multiple times. Applying it once is nearly as good as applying it twice, and much less complex. Therefore, we focus on a single divideandconquer step. Improvements are reported in Figure 14. Energy efficiency improves by almost 25% over the previous design point, because ADCs end up being used 75% of the times in the 1700 ns window. However, this comes at a cost of 6.4% reduction in area efficiency because of the need for more crossbars and increase in HTree bandwidth to send the sum of inputs.
eDRAM Buffer Requirements
In Figure 15, we report the buffer requirement per tile when the layers are spread across many tiles. We consider this for a variety of tile/IMA configurations. Image size has a linear impact on the buffering requirement. For 256256 images, the buffer reduction technique leads to the choice of a 16 KB buffer instead of the 64 KB used in ISAAC, a 75% reduction. Figure 16 shows 6.5% average improvement in area efficiency because of this technique.
ConvTiles and ClassifierTiles
Figure 17 plots the decrease in power requirement when FC tiles are operated at , and slower than the conv tiles. None of these configurations lower the throughput as the FC layer is not on the critical path. Since ADC power scales linearly with sampling resolution, the power profile is lowest when the ADCs work slower. This leads to 50% lower peak power on average. In Figure 18, we plot the increase in area efficiency when multiple crossbars share the same ADC in FC tiles. The underutilization of FC tiles provides room for making them storage efficient, saving on average 38% of chip area. We do not increase the ratio beyond 4 because the multiplexer connecting the crossbars to the ADC becomes complex. Resnet does not gain much from the heterogeneous tiles because it needs relatively fewer FC tiles.
Strassen’s Algorithm
Strassen’s optimization is especially useful when large matrix multiplication can be performed in the conv layers without much wastage of crossbars. This provides room for decomposition of these large matrices, which is the key part of Strassen’s technique. We note that Resnet has high wastage when using larger IMAs, and thus does not benefit at all from this technique. Overall, Strassen’s algorithm increases the energy efficiency by 4.5% as seen in Figure 19.
Putting it all together.
Figure 20 plots the incremental effect of each of our techniques on peak computational and power efficiency of DaDianNao, ISAAC, and Newton. We do not include the heterogeneous FC tile in this plot because it is forcibly operated slowly because it is noncritical; as a result, it’s peak throughput is lower by definition. We see that both adaptive ADC and divide & conquer play a significant role in increasing the PE. While the impact of Strassen’s technique is not visible in this graph, it manages to free up resources (1 every 8 IMA) in a tile, thus providing room for more compact mapping of networks, and reducing ADC utilization.
Figure 21 shows a perbenchmark improvement in area efficiency and the contribution of each of our techniques. The compact HTree and the FC tiles are the biggest contributors. Figure 22 similarly shows a breakdown for decrease in power envelope, and Figure 23 does the same of improvement in energy efficiency. Multiple innovations (HTree, adaptive ADC, Karatsuba, FC tiles) contribute equally to the improvements. We also observed that the Adaptive ADC technique’s improvement is not very sensitive to the ADC design. We evaluated ADCs where the CDAC power dissipates 10% and 27% of ADC power; the corresponding improvements with the Adaptive ADC were 13% and 12% respectively.
Figure 24 compares the 8bit version of Newtonwith Google’s TPU architecture. Note that while Google has already announced second generation TPU with 16bit support, its architectural details are not public yet. Hence, we limit our analysis to TPU1. Also, we scale the area such that the die area is same for both the architectures, i.e. an isoarea comparison. For TPU, we perform batch processing enough to not exceed the latency target of 7ms as demanded by most application developers. Since Newtonpipeline is deterministic and as its crossbars are statically mapped to different layers, the latency of images is always the same irrespective of batch size, which is comfortably less than 7ms for all the evaluated benchmarks. We also model TPU1 with GDDR5 memory to allocate sufficient bandwidth.
Figure 24 shows throughput and energy improvement of Newtonover TPU for various benchmarks. Newtonhas an average improvement of 10.3 in throughput and 3.4 in energy over TPU. In terms of computational efficiency (CE) calculated using peak throughput, Newtonis 12.3 better than TPU. However, when operating on FC layer, due to idle crossbars in Newton, this advantage reduces to 10.3 for actual workloads.
When considering power efficiency (PE) calculated using peak throughput and area, although Newtonis only 1.6 better than TPU, the actual benefit goes up for real workloads, increasing it to 3.4. This is because of TPU’s low memory bandwidth coupled with reduced batch size for some workloads. As we discussed earlier, the batch size in TPU is adjusted to meet the latency target. Since large batch size alleviates memory bandwidth problem, reducing it to meet latency target directly impacts power efficiency due to more GDDR fetches and idle processing units.
From the figure, it can also be noted that the throughput improvement of Alexnet and Resnet aren’t as high as the other benchmarks because of their relatively small networks. This increases the batch size, improving the data locality for FC layer weights. On the other hand, the MSRA3 benchmark has higher energy consumption than other workloads because for MSRA3, TPU can process only one image per batch. This dramatically increases TPU’s idle time while fetching a large number of weights for the FC layers. In short, Newton’s insitu computation achieves superior energy and performance values over TPU as the proposed design limits data movement while reducing analog computation overhead.
Vi Conclusions
In this work, we target resource provisioning and efficiency in a crossbarbased deep network accelerator. Starting with the ISAAC architecture, we show that three approaches – heterogeneity, mapping constraints, and divide & conquer – can be applied within a tile and within an IMA. This results in smaller eDRAM buffers, smaller HTree, energyefficient ADCs with varying resolution, energy and areaefficiency in classifier layers, and fewer computations. Many of these ideas would also apply to a general accelerator for matrixmatrix multiplication, as well as to other neural networks such as RNN, LSTM, etc. The Newton architecture cuts the current gap between ISAAC and an ideal neuron in half.
Appendix: Crossbar Implementations
This appendix discusses how crossbars can be designed to withstand noise effects in analog circuits.
Process Variation and Noise: Since an analog crossbar uses actual conductance of individual cells to perform computation, it is critical to do writes at maximum precision. We make two design choices to improve write precision. First, we equip each cell with an access transistor (1T1R cell) to precisely control the amount of write current going through it. While this increases area overhead, it eliminates sneak currents and their negative impact on write voltage variation [31]. Second, we use a closed loop write circuit with current compliance that does many iterations of programandverify operations. Prior work has shown that such an approach can provide more precise states at the cost of increased write time even with high process variation in cells [3].
In spite of a robust write process, a cell’s resistance will still deviate from its normal value within a tolerable range. This range will ultimately limit either the number of levels in a cell or the number of simultaneously active rows in a crossbar. For example, if a cell write can achieve a resistance within r (r is a function of noise and parasitic), if is the number of levels in a cell, and is the max range of resistance of a cell, then we set the number of active rows to /(.r) to ensure there are no corrupted bits at the ADC.
Crossbar Parasitic: While a sophisticated write circuit coupled with limited array size can help alleviate process variation and noise, IR drop along rows and columns can also reduce crossbar accuracy. When a crossbar is being written during initialization, the access transistors in unselected cells shut off the sneak current path, limiting the current flow to just the selected cells. However, when a crossbar operates in compute mode in which multiple rows are active, the net current in the crossbar increases, and the current path becomes more complicated. With access transistors in every cell in the selected rows in ON state, a network of resistors is formed with every cell conducting varying current based on its resistance. As wire links connecting these cells have nonzero resistance, the voltage drop along rows and columns will impact the computation accuracy. Thus, a cell at the far end of the driver will see relatively lower read voltage compared to a cell closer to the driver. This change in voltage is a function of both wire resistance and the current flowing through wordlines and bitlines, which in turn is a function of the data pattern in the array. This problem can be addressed by limiting the DAC voltage range and doing data encoding to compensate for the IR drop [14]. Since the matrix being programmed into a crossbar is known beforehand, during the initialization phase of a crossbar, it is possible to account for voltage drops and adjust the cell resistance appropriately. Hu et al. [14] have demonstrated successful operation of a 256256 crossbar with 5bit cells even in the presense of thermal noise in memristor, short noise in circuits, and random telegraphic noise in the crossbar. For this work, a conservative model with a 128128 crossbar with 2bit cells and 1bit DAC emerges as an ideal design point in most experiments.
bstctl:etal, bstctl:nodash, bstctl:simpurl
References
 [1] “The Tomorrow Show: Three New Technologies from Hewlett Packard Labs,” 2016, https://youtu.be/tABpRpBW6h0?t=18m38s.
 [2] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Jerger, and A. Moshovos, “Cnvlutin: ZeroNeuronFree Deep Convolutional Neural Network Computing,” in Proceedings of ISCA43, 2016.
 [3] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High Precision Tuning of State for Memristive Devices by Adaptable VariationTolerant Algorithm,” Nanotechnology, 2012.

[4]
M. N. Bojnordi and E. Ipek, “Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning,” in
Proceedings of HPCA22, 2016.  [5] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A SmallFootprint HighThroughput Accelerator for Ubiquitous MachineLearning,” in Proceedings of ASPLOS, 2014.
 [6] Y.H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for EnergyEfficient Dataflow for Convolutional Neural Networks,” in Proceedings of ISCA43, 2016.
 [7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “DaDianNao: A MachineLearning Supercomputer,” in Proceedings of MICRO47, 2014.
 [8] P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel ProcessingInMemory Architecture for Neural Network Computation in ReRAMbased Main Memory,” in Proceedings of ISCA43, 2016.
 [9] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proceedings of ISCA42, 2015.
 [10] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in Proceedings of ASLPOS22, 2017.
 [11] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in Proceedings of ICML, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv preprint arXiv:1512.03385, 2015.
 [13] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing HumanLevel Performance on Imagenet Classification,” in Proceedings of ICCV, 2015.
 [14] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, R. S. Williams, and J. Yang, “DotProduct Engine for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate MatrixVector Multiplication,” in Proceedings of DAC53, 2016.
 [15] A. Kahng, B. Li, L.S. Peh, and K. Samadi, “ORION 2.0: A Fast and Accurate NoC Power and Area Model for EarlyStage Design Space Exploration,” in Proceedings of DATE, 2009.
 [16] D. Kim, J. H. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A Programmable Digital Neuromorphic Architecture with HighDensity 3D Memory,” in Proceedings of ISCA43, 2016.
 [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proceedings of NIPS, 2012.
 [18] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Brandli, M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici, “A 3.1 mW 8b 1.2 GS/s SingleChannel Asynchronous SAR ADC with Alternate Comparators for Enhanced Speed in 32 nm Digital SOI CMOS,” Journal of SolidState Circuits, 2013.
 [19] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, “PuDianNao: A Polyvalent Machine Learning Accelerator,” in Proceedings of ASPLOS20, 2015.
 [20] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An Instruction Set Architecture for Neural Networks,” in Proceedings of ISCA43, 2016.
 [21] E. J. MercedGrafals, N. Dávila, N. Ge, R. S. Williams, and J. P. Strachan, “Repeatable, Accurate, and High Speed MultiLevel Programming of Memristor 1T1R Arrays for Power Efficient Analog Computing Applications,” Nanotechnology, 2016.
 [22] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0,” in Proceedings of MICRO, 2007.
 [23] B. Murmann, “ADC Performance Survey 19972015 (ISSCC & VLSI Symposium),” 2015, http://web.stanford.edu/~murmann/adcsurvey.html.
 [24] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. M. Hernandez, Lobato, G.Y. Wei, and D. Brooks, “Minerva: Enabling LowPower, HighAccuracy Deep Neural Network Accelerators,” in Proceedings of ISCA43, 2016.

[25]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet Large Scale
Visual Recognition Challenge,”
International Journal of Computer Vision
, 2014.  [26] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars,” in Proceedings of ISCA, 2016.
 [27] S.Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proceedings of ISCA, 2016.
 [28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [29] N. Verma and A. P. Chandrakasan, “An ultra low energy 12bit rateresolution scalable SAR ADC for wireless sensor nodes,” IEEE Journal of SolidState Circuits, 2007.
 [30] Y.Li, Y.Zhang, S.Li, P.Chi, C.Jiang, P.Qu, Y.Xie, and W.Chen, “NEUTRAMS: Neural Network Transformation and Codesign under Neuromorphic Hardware Constraints,” in Proceedings of MICRO, 2016.
 [31] M. Zangeneh and A. Joshi, “Design and Optimization of Nonvolatile Multibit 1T1R Resistive RAM,” Proceedings of the Transactions on VLSI Systems, 2014.
Comments
There are no comments yet.