Accelerators are in vogue today, primarily because it is evident that annual performance improvements can be sustained via specialization. There are also many emerging applications that demand high-throughput low-energy hardware, such as the machine learning tasks that are becoming commonplace in enterprise servers, self-driving cars, and mobile devices. The last two years have seen a flurry of activity in designing machine learning accelerators[5, 7, 9, 19, 8, 30, 27, 24, 16]. Similar to our work, most of these recent works have focused on inference in artificial neural networks, and specifically deep convolutional networks, that achieve state-of-the-art accuracies on challenging image classification workloads.
While most of these recent accelerators have used digital architectures [5, 7], a few have leveraged analog acceleration on memristor crossbars [26, 8, 4]. Such accelerators take advantage of in-situ computation to dramatically reduce data movement costs. Each crossbar is assigned to execute parts of the neural network computation and programmed with the corresponding weight values. Input neuron values are fed to the crossbar, and by leveraging Kirchoff’s Law, the crossbar outputs the corresponding dot product. The neuron output undergoes analog-to-digital conversion (ADC) before being sent to the next layer. Multiple small-scale prototypes of this approach have also been demonstrated [1, 21].
The design constraints for digital accelerators are very different from their analog designs. High communication overhead and the memory bottleneck are still first order design constraints in digital, whereas the computation overhead arising from analog-to-digital and digital-to-analog conversions, and balancing the extent of digital computation in an analog architecture are more critical in analog accelerators. In this work, we show that computation is a critical problem in analog and leverage numeric algorithms to reduce conversion overheads. Once we improve the efficiency of computation, the next major overhead comes from communication and storage. Towards this end, we discuss mapping techniques and buffer management strategies to further improve analog accelerator efficiency.
With these innovations in place, our new design, Newton, moves the analog architecture closer to the bare minimum energy required to process one neuron. We define an ideal neuron as one that keeps the weight in-place adjacent to a digital ALU, retrieves the input from an adjacent single-row eDRAM unit, and after performing one digital operation, writes the result to another adjacent single-row eDRAM unit. This energy is lower than that for a similarly ideal analog neuron because of the ADC cost. This ideal neuron operation consumes 0.33 pJ. An average DaDianNao operation consumes 3.5 pJ because it pays a high price in data movement for inputs and weights. ISAAC  is a state-of-the-art analog design that achieves an order of magnitude better performance than digital accelerators such as DaDianNao. An average ISAAC operation consumes 1.8 pJ because it pays a moderate price in data movement for inputs (weights are in-situ) and a high price for ADC. An average Eyeriss  operation consumes 1.67 pJ because of an improved dataflow to maximize reuse. The innovations in Newton push the analog architecture closer to the ideal neuron by consuming 0.85 pJ per operation. Relative to ISAAC, Newton achieves a 77% decrease in power, 51% decrease in energy, and increase in throughput/area.
We consider different CNNs presented in the ILSVRC challenge of image classification for the IMAGENET dataset. The suite of benchmarks considered in this paper is representative of the various dataflows in such image classification networks. For example, Alexnet is the simplest of CNNs with a reasonable accuracy, where a few convolution layers at the start extract features from the image, followed by fully connected layers that classify the image. The other networks were designed with a similar structure but made deeper and wider with more parameters. For example, MSRA Prelu-net  has 14 more layers than Alexnet  and has 330 million parameters, which is 5.5 higher than Alexnet. On the other hand, residual nets have forward connections with hops, i.e., output of a layer is passed on to not only the next layer but subsequent layers. Even though the number of parameters in Resnets  are much lower, these networks are much deeper and have a different dataflow, which changes the buffering requirements in accelerator pipelines.
Ii-B The Landscape of CNN Accelerators
Digital Accelerators. The DianNao  and DaDianNao  accelerators were among the first to target deep convolutional networks at scale. DianNao designs the digital circuits for a basic NFU (Neural Functional Unit). DaDianNao is a tiled architecture where each tile has an NFU and eDRAM banks that feed synaptic weights to that NFU. DaDianNao uses many tiles on many chips to parallelize the processing of a single network layer. Once that layer is processed, all the tiles then move on to processing the next layer in parallel. Recent papers, e.g., Cnvlutin , have modified DaDianNao so the NFU does not waste time and energy processing zero-valued inputs. EIE  and Minerva  address sparsity in the weights. Eyeriss  and ShiDianNao  improve the NFU dataflow to maximize operand reuse. A number of other digital designs [16, 20, 10] have also emerged in the past year.
Analog Accelerators. Two CNN accelerators introduced in the past year, ISAAC  and PRIME , have leveraged memristor crossbars to perform dot product operations in the analog domain. We will focus on ISAAC here because it out-performs PRIME in terms of throughput, accuracy, and ability to handle signed values. ISAAC is also able to achieve nearly 8 and 5 higher throughput than digital accelerators DaDianNao and Cnvlutin respectively.
Pipeline of Memristive Crossbars. In ISAAC, memristive crossbar arrays are used to perform analog dot-product operations. Neuron inputs are provided as voltages to wordlines; neuron weights are represented by pre-programmed cell conductances; neuron outputs are represented by the currents in each bitline. The neuron outputs are processed by an ADC and shift-and-add circuits. They are then sent as inputs to the next layer of neurons. As shown in Figure 1, ISAAC is a tiled architecture; one or more tiles are dedicated to process one layer of the neural network. To perform inference for one input image, neuron outputs are propagated from tile to tile until all network layers have been processed.
Tiles, IMAs, Crossbars. An ISAAC chip consists of many tiles connected in a mesh topology (Figure 1). Each tile includes an eDRAM buffer that supplies inputs to In-situ Multiply Accumulate (IMA) units. The IMA units consist of memristor crossbars that perform the dot-product computation, ADCs, and shift-and-add circuits that accumulate the digitized results. With a design space exploration, the tile is provisioned with an optimal number of IMAs, crossbars, ADCs, etc. Within a crossbar, a 16-bit weight is stored 2 bits per cell, across 8 columns. A 16-bit input is supplied as voltages over 16 cycles, 1 bit per cycle, using a trivial DAC array. The partial outputs are shifted and added across 8 columns, and across 16 cycles to give the output of MAC operations. Thus, there are two levels of pipelining in ISAAC: (i) the intra-tile pipeline, where inputs are read from eDRAM, processed by crossbars in 16 cycles, and aggregated, (ii) the inter-tile pipeline, where neuron outputs are transferred from one layer to the next. The intra-tile pipeline has a cycle time of 100 ns, matching the latency for a crossbar read. Inputs are sent to a crossbar in an IMA using an input h-tree network. The input h-tree has sufficient bandwidth to keep all crossbars active without bubbles. Each crossbar has a dedicated ADC operating at 1.28 shared across its 128 bitlines to convert the analog output to digital in 100 ns. An h-tree network is then used to collect digitized outputs from crossbars.
Crossbar Challenges. As with any new technology, a memristor crossbar has unique challenges, mainly in two respects. First, mapping a matrix onto a memristor crossbar array requires programming (or writing) cells with the highest precision possible. Second, real circuits deviate from ideal operation due to parasitics such as wire resistance, device variation, and write/read noise. All of these factors can cause the actual output to deviate from its ideal value. Recent work  has captured many of these details to show the viability of prototypes . The Appendix summarizes some of these details.
The design constraints for digital accelerators are very different from their analog counterparts. In any digital design, the overhead of communication, arising from the need to fetch both input feature and weights from memory, is the major limiting factor. As a result, most optimizations focus on improving memory bandwidth (e.g., HBM or GDDR), memory utilization (compression, zero value elimination, etc.) and scheduling (e.g., batching) to improve communication efficiency and performance. Techniques that primarily target improving digital computation should carefully consider their impact on additional on-chip storage and communication overheads, which can negatively affect overall efficiency.
In analog, because we do in-situ computation, only one of the operands needs to be transferred, and this reduces the communication overhead by at least 2. Furthermore, the transferred value (the input vector in the form of crossbar row voltage) is streamed across the entire crossbar (matrix values) guaranteeing high reuse and locality. The compute density of analog in-situ units is also better than digital accelerators. As analog crossbars store neural network weights, even if they are not performing computation, they still act as on-chip storage. Whereas, digital computational units need to have high utilization to maximize performance, otherwise their area is better utilized for more on-chip storage. Both these factors provide more flexibility for analog accelerators to explore computational optimizations at the expense of either more communication or crossbar storage.
In a digital design, the datapath size and its overhead are pre-determined. A 16-bit datapath operated with 12-bit values will achieve only marginal reduction in overhead as pipeline buffers and wire repeaters switch every cycle. However, as analog computation is being performed at bit level (1 or 2 bit computations in each bitline), reducing the operand size, say, from 16-bits to 12-bits will correspondingly reduce ADC and DAC usage, leading to better efficiency. Note that even though an analog architecture consists of both digital and analog computations, the overhead of analog dominates - 61% of the total power .
We will first take a closer look at a simple dot-product being performed using crossbars. Consider a 1128 vector being multiplied with a 128128 matrix (all values are 16 bits). Figure 2 shows the energy breakdown of the vector-matrix multiplication pipeline compared against digital designs for various architectures. To model the analog overhead, we consider 2-bit cells, 1-bit DAC, and 16-bit values interleaved across eight crossbars. In a single iteration, a crossbar column is performing a dot-product involving 128 rows, 1-bit inputs, and 2-bit cells; it therefore produces a 9-bit result requiring a 9-bit ADC111 Prior work (ISAAC) has shown that simple data encoding schemes can reduce the ADC resolution by 1 bit . .
We must shift and add the results of eight such columns, yielding a 23-bit result. These results must also be shifted and added across 16 iterations, finally yielding a 39-bit output. Finally, the scaling factor is applied to convert the 39-bit result to a 16-bit output. As the figure shows, communication and memory accesses are the major limiting factor for digital architectures, whereas for analog, computation overhead, primarily arising from ADC dominates.
Based on these observations, we present optimizations that exploit high compute density and flexible datapath enabled by analog to improve computation efficiency. These optimizations are applicable to any accelerator that uses analog in-situ crossbars as the techniques primarily target high ADC overhead. Once we improve the efficiency of the computation, the next major overhead comes from communication of values. As communication (on-chip and off-chip) and storage overheads (SRAM or eDRAM buffers) depend on the overall accelerator architecture, we choose the ISAAC architecture as the baseline when discussing our optimizations.
Iii-a Reducing Computational Overhead
Iii-A1 Karatsuba’s Divide and Conquer Multiplication Technique
With ADC being the major contributor to the total power, we discuss a divide and conquer strategy at the bit level, that reduces pressure on ADC usage and hence ADC power. A classic multiplication approach for two n-bit numbers has a complexity of where each bit of a number is multiplied with n-bits of the other number, and the partial results are shifted and added to get the final 2n-bit result.
Karatsuba’s divide and conquer algorithm manages to reduce the complexity from to . As shown in Figure 3, it divides the numbers into two halves of n/2 bits, MSB bits and LSB bits, and instead of performing four smaller n/2-bit multiplications, it calculates the result with two n/2-bit multiplications and one (n/2 + 1)-bit multiplication.
To illustrate the benefit of this technique, consider the same example discussed earlier using 128x128 crossbars, 2-bit cells, and 1-bit DAC. The product of input and weight is performed on 8 crossbars in 16 cycles (since each weight is spread across 8 cells in 8 different crossbars and the input is spread across 16 iterations). In the example in Figure 3, is performed on four crossbars in 8 iterations (since we are dealing with fewer bits for weights and inputs). The same is true for . A third set of crossbars stores the weights and receives the pre-computed inputs . This computation is spread across 5 crossbars and 9 iterations. We see that the total amount of work has reduced by 15%.
There are a few drawbacks as well. A computation now takes 17 iterations instead of 16. The net area increases because the network must send inputs and in parallel, an additional crossbar is needed, the output buffer is larger to store subproducts, and 128 1-bit full adders are required to compute . Again, given that the ADC is the primary bottleneck, these other overheads are relatively minor.
Iii-A2 Strassen’s Algorithm
A divide and conquer approach can also be applied to matrix-matrix multiplication. By partitioning each matrix and into 4 sub-matrices, we can express matrix-matrix multiplication in terms of multiplications of sub-matrices. A typical algorithm would require 8 sub-matrix multiplications, followed by an aggregation step. But as shown in Figure 4, linear algebra manipulations can perform the same computation with 7 sub-matrix multiplications, with appropriate pre- and post- processing. Similar to Karatsuba’s algorithm, this has the advantage of reducing ADC usage and power.
The above two optimizations reduce the computation energy by 20.6% while incurring a storage overhead of 4.3%. While both divide and conquer algorithms (Karatsuba’s and Strassen’s algorithms) are highly effective for a crossbar-based architecture, they have very little impact on other digital accelerators. For example, these algorithms may impact the efficiency of the NFUs in DaDianNao, but DaDianNao area is dominated by eDRAM banks and not NFUs. In fact, Strassen’s algorithm can lower DaDianNao efficiency because buffering requirements may increase. On the other hand, analog computations are dominated by ADCs, so efficient computation does noticeably impact overall efficiency. Further, some of the pre-processing for these algorithms is performed when installing weights on analog crossbars, but has to be performed on-the-fly for digital accelerators.
Iii-A3 Adaptive ADCs.
A simple dot-product operation on 16-bit values performed using crossbars typically result in an output of more than 16-bits. In the example discussed earlier, using 2-bit cells in crossbars and 1-bit DACs yielded 39-bit output. Once the scaling factor is applied, the least significant 10 bits are dropped. The most significant 13 bits represent an overflow that cannot be captured in the 16-bit result, so they are effectively used to clamp the result to a maximum value.
What is of note here is that the output from every crossbar column in every iteration is being resolved with a high-precision 9-bit ADC, but many of these bits contribute to either the 10 least significant bits or the 13 most significant bits that are eventually going to be ignored. This is an opportunity to lower the ADC precision and ignore some bits, depending on the column and the iteration being processed. Figure 5 shows the number of relevant bits emerging from every column in every iteration. Note that before dropping the highest ignored least significant bit, we use rounding modes to generate carries, similar to .
The ADC accounts for a significant fraction of IMA power. When the ADC is operating at a lower resolution, it has less work to do. In every 100 ns iteration, we tune the resolution of a SAR ADC to match the requirement in Figure 5. Thus, the use of adaptive ADCs helps reduce IMA power while having no impact on performance. We are also ignoring bits that do not show up in a 16-bit fixed-point result, so we are not impacting the functional behavior of the algorithm, thus having zero impact on algorithm accuracy.
A SAR ADC does a binary search over the input voltage to find the digital value, starting from the MSB. A bit is set to 1, and the resulting digital value is converted to analog and compared with the input voltage. If the input voltage is higher, the bit is set to one, the next bit is changed, and the process repeats. If the number of bits to be sampled is reduced, the circuit can ignore the latter stages. The ADC simply gates off its circuits until the next sample is provided. It is important to note that the ADC starts the binary search from the MSB, thus it is not possible to sample just the lower significant bits of an output without knowing the MSBs. But in this case, we have a unique advantage: if any of the MSBs to be truncated is 1, then the output neuron value is clamped to the highest value in the fixed point range. Thus, in order to sample a set of LSBs, the ADC starts the binary search with the LSB+1 bit. If that comparison yields true, it means at least one of the MSB bits is one. This signal is sent across the inter-crossbar network (e.g. HTree) and the output is clamped.
In conventional SAR ADCs , a third of the power is dissipated in the capacitive DAC (CDAC), a third in digital circuits, and a third in other analog circuits. The MSB decision in general consumes more power because it involves charging up the CDAC at the end of every sampling iteration. Recent trends show CDAC power diminishing due to use of tiny unit capacitances (about 2fF) and innovative reference buffer designs, leading to ADCs consuming more power in analog and digital circuits [18, 23]. The Adaptive ADC technique is able to reduce energy consumption irrespective of the ADC design since it eliminates both LSB and MSB tests across the 16 iterations.
Iii-B Communication and Storage Optimizations
So far, we have discussed optimization techniques that are applicable to any analog crossbar architectures. To further improve analog accelerator efficiency, it is critical to also reduce communication and storage overhead. As the effectiveness of optimizing communication varies based on the overall architecture, we describe our proposals in the context of the ISAAC architecture. Similar to ISAAC, we employ a tiled architecture, where every tile is composed of several IMAs. A set of IMAs along with digital computational units and eDRAM storage form a tile.
The previous sub-section focused on techniques to improve an IMA; we now shift our focus to the design of a tile. We first reduce the size of the buffer in each tile that feeds all its IMAs. We then create heterogeneous tiles that suit convolutional and fully-connected layers.
Iii-B1 Reducing Buffer Sizes.
Because ISAAC did not place constraints on how layers are mapped to crossbars and tiles, the eDRAM buffer was sized to 64KB to accommodate the worst-case requirements of workloads. Here, we design mapping techniques that reduce storage requirements per tile and move that requirement closer to the average-case.
To explain the impact of mapping on buffering requirements, first consider the convolutional layer shown in Figure 6a. Once a certain number of inputs are buffered (shown in green and pink), the layer enters steady state; every new input pixel allows the convolution to advance by another step. The buffer size is a constant as the convolution advances (each new input evicts an old input that is no longer required). In every step, a subset of the input buffer is fed as input to the crossbar to produce one pixel in each of many output feature maps. If the crossbar is large, it is split across 2 tiles, as shown in Figure 6a. The split is done so that Tile 1 manages the green buffer and green inputs, and Tile 2 manages the pink buffer and pink inputs. Such a split means that inputs do not have to be replicated on both tiles, and buffering requirements are low.
Now, consider an early convolutional layer. Early convolutional layers have more work to do than later layers since they deal with larger feature maps. In ISAAC, to make the pipeline balanced, early convolutional layers are replicated so their throughput matches those of later layers. Figure 6
b replicates the crossbar; one is responsible for every odd pixel in the output feature maps, while the other is responsible for every even pixel. In any step, both crossbars receive very similar inputs. So the same input buffer can feed both crossbars.
If a replicated layer is large enough that it must be spread across (say) 4 tiles, we have two options. Figure 6 c and d show these two options. If the odd computation is spread across two tiles (1 and 2) and the even computation is spread across two different tiles (3 and 4), the same green inputs have to be sent to Tile 1 and Tile 3, i.e., the input buffers are replicated. Instead, as shown in Figure 6d, if we co-locate the top quadrant of the odd computation and the top quadrant of the even computation in Tile 1, the green inputs are consumed entirely within Tile 1 and do not have to be replicated. This partitioning leads to the minimum buffer requirement.
The bottomline from this mapping is that when a layer is replicated, the buffering requirements per neuron and per tile are reduced. This is because multiple neurons that receive similar inputs can reuse the contents of the input buffer. Therefore, heavily replicated (early) layers have lower buffer requirements per tile than lightly replicated (later) layers. If we mapped these layers to tiles as shown in Figure 7a, the worst-case buffering requirement goes up (64 KB for the last layer), and early layers end up under-utilizing their 64 KB buffer. To reduce the worst-case requirement and the under-utilization, we instead map layers to tiles as shown in Figure 7b. Every layer is finely partitioned and spread across 10 tiles, and every tile maps part of a layer. By spreading each layer across many tiles, every tile can enjoy the buffering efficiency of early layers. By moving every tile’s buffer requirement closer to the average-case (21 KB in this example), we can design a tile with a smaller eDRAM buffer (21 KB instead of 64 KB) that achieves higher overall computational efficiency. This has minimal impact on inter-tile neuron communication because adjacent layers are mapped to the same tile and hence, even though a single layer is distributed across multiple tiles, the neurons being communicated across layers have to typically travel short distances.
Iii-B2 Different Tiles for Convolutions and Classifiers.
While ISAAC uses the same homogeneous tile for the entire chip, we observe that convolutional layers have very different resource demands than fully-connected classifier layers. The classifier (or FC) layer has to aggregate a set of inputs required by a set of crossbars; the crossbars then perform their computation; the inputs are discarded and a new set of inputs is aggregated. This results in the following properties for the classifier layer:
The classifier layer has a high communication-to-compute ratio, so the router bandwidth puts a limit on how often the crossbars can be busy.
The classifier also has the highest synaptic weight requirement because every neuron has private weights.
The classifier has low buffering requirements – an input is seen by several neurons in parallel, and the input can be discarded right after.
We therefore design special tiles customized for classifier layers that:
have a higher crossbar-to-ADC ratio (4:1 instead of 1:1),
operate the ADC at a lower rate (10 Msamples/sec instead of 1.2 Gsamples/sec),
have a smaller eDRAM buffer size (4 KB instead of 16 KB).
For small-scale workloads that are trying to fit on a single chip, we would design a chip where many of the tiles are conv-tiles and some are classifier-tiles (a ratio of 1:1 is a good fit for most of our workloads). For large-scale workloads that use multiple chips, each chip can be homogeneous; we use roughly an equal number of conv-chips and classifier-chips. The results consider both cases.
Iii-C Putting the Pieces Together
We use ISAAC as the baseline architecture and evaluate the proposed optimizations by enhancing it. We already presented a general overview of ISAAC in section 1. We make two key enhancements to ISAAC to improve both area and compute efficiencies. Note that these two optimizations are specific to ISAAC architecture, and following this, we present implementation details of numerical algorithms discussed in previous sub-sections.
First, ISAAC did not place any constraints on how a neural network can be mapped to its many tiles and IMAs. As a result, its resources, notably the HTree and buffers within an IMA, are provisioned to handle the worst case. This has a negative impact on power and area efficiency. Instead, we place constraints on how the workload is mapped to IMAs. While this inflexibility can waste a few resources, we observe that it also significantly reduces the HTree size and hence area per IMA. The architecture is still general-purpose, i.e., arbitrary CNNs can be mapped to it.
Second, within an IMA, we co-locate an ADC with each crossbar. The digitized outputs are then sent to the IMA’s output register via an HTree network. While ISAAC was agnostic to how a single synaptic weight was scattered across multiple bitlines, we adopt the following approach to boost efficiency. A 16-bit weight is scattered across 8 2-bit cells; each cell is placed in a different crossbar. Therefore, crossbars 0 and 8 are responsible for the least significant bits of every weight, and crossbars 7 and 15 are responsible for the most significant bits of every weight. We also embed the shift-and-add units in the HTree. So the shift-and-add unit at the leaf of the HTree adds the digitized 9-bit dot-product results emerging from two neighboring crossbars. Because the operation is a shift-and-add, it produces a 11-bit result. The next shift-and-add unit takes 2 11-bit inputs to produce a 13-bit input, and so on. We further modify mapping by placing the constraint that an IMA cannot be shared by multiple network layers.
To implement Karatsuba’s algorithm, we modify the In-situ Multiply Accumulate units (IMA) as shown in Figure 9. The changes are localized to a single mat. Each mat now has two crossbars that share the DAC and ADC. Given the size of the ADC, the extra crossbar per mat has a minimal impact on area. The left crossbars in four of the mats now store (Figure 3); the left crossbars in the other four mats store ; the right crossbars in five of the mats store ; the right crossbars in three of the mats are unused. In the first 8 iterations, the 8 ADCs are used by the left crossbars. In the next 9 iterations, 5 ADCs are used by the right crossbars. As discussed earlier, the main objective here is to lower power by reducing use of the ADC. Divide & Conquer can be recursively applied further. When applied again, the computation keeps 8 ADCs busy in the first 4 iterations, and 6 ADCs in the next 10 iterations. This is a 28% reduction in ADC use, and a 13% reduction in execution time. But, we pay an area penalty because 20 crossbars are needed per IMA. Figure 8 shows the mapping of computations within IMA to implement Strassen’s algorithm. The computations () in Strassen’s algorithm (Figure 4) are mapped to 7 IMAs in the tile. The 8th IMA can be allocated to another layer’s computation.
With all these changes targeting high compute efficiency and low communication and storage overhead, we refer to the updated analog design as the Newton architecture.
Modeling Area and Energy
For modeling the energy and area of the eDRAM buffers and on-chip interconnect like the HTree and tile bus, we use CACTI 6.5  at 32 nm. The area and energy model of a memristor crossbar is based on . We adapt the area and energy of shift-and-add circuits, max/average pooling block and sigmoid operation similar to the analysis in DaDianNao  and ISAAC . We avail the same HyperTransport serial link model for off-chip interconnects as used by DaDianNao  and ISAAC . The router area and energy is modeled using Orion 2.0 . While our buffers can also be implemented with SRAM, we use eDRAM to make an apples-to-apples comparison with the ISAAC baseline. Newton is only used for inference, with a delay of 16.4 ms to pre-load weights in a chip.
In order to model the ADC energy and area, we use a recent survey  of ADC circuits published in different circuit conferences. The Newton architecture uses the same 8-bit ADC  at 32 nm as used in ISAAC, partly because it yields the best configuration in terms of area/power and meets the sampling frequency requirement, and partly because it can be reconfigured for different resolutions. This is at the cost of minimal increase in area of the ADC. We scale the ADC power with respect to different sampling frequency according to another work by Kull et al. . The SAR ADC has six different components: comparators, asynchronous clock logic, sampling clock logic, data memory and state logic, reference buffer, and capacitive DAC. The ADC power for different sampling resolution is modeled by gating off the other components except the sampling clock.
We consider a 1-bit DAC as used in ISAAC because it is relatively small and has high SNR value. Since DAC is used in every row of the crossbar, a 1-bit DAC improves the area efficiency.
The key parameters in the architecture that largely contribute to our analysis are reported in Table I.
This work considers recent workloads with state-of-the-art accuracy in image classification tasks (summarized in Table II
). We create an analytic model for a Newton pipeline within an IMA and within a tile and map the suite of benchmarks, making sure that there are no structural hazards in any of these pipelines. We consider network bandwidth limitations in our simulation model to estimate throughput. Since ISAAC is a throughput architecture, we do an iso-throughput comparison of the Newton architecture with ISAAC for the different intra-IMA or intra-tile optimizations. Since the dataflow in the architecture is bounded by the router bandwidth, in each case, we allocate enough resources till the network saturates to create our baseline model. For subsequent optimizations, we retain the same throughput. Similar to ISAAC, data transfers between tiles on-chip, and on the HT link across chips have been statically routed to make it conflict free. Like ISAAC, the latency and throughput of Newton for the given benchmarks can be calculated analytically using a deterministic execution model. Since there aren’t any run-time dependencies on the control flow or data flow of the deep networks, analytical estimates are enough to capture the behavior of cycle-accurate simulations.
We create a similar model for ISAAC, taking into considerations all the parameters mentioned in their paper.
The Newton architecture can be designed by optimizing one of the following two metrics:
CE: Computational Efficiency which is the number of fixed point operations performed per second per unit area, .
PE: Power Efficiency which is the number of fixed point operations performed per second per unit power, .
For every considered innovation, we model Newton for a variety of design points that vary crossbar size, number of crossbars per IMA, and number of IMAs per tile. In most cases, the same configurations emerged as the best. We therefore focus most of our analysis on this optimal configuration that has 16 IMAs per tile, where each IMA uses 16 crossbars to process 128 inputs for 256 neurons. We report the area, power, and energy improvement for all the deep neural networks in our benchmark suite.
|Router||32 flits, 8 ports||168 mW||0.604|
|ADC||8-bit resolution||3.1 mW||0.0015|
|1.2 GSps frequency|
|Hyper Tr||4 links @ 1.6GHz||10.4 W||22.88|
|6.4 GB/s link bw|
|DAC array||128 1-bit resolution||0.5 mW||0.00002|
|Memristor crossbar||0.3 mW||0.0001|
|224||11x11, 96 (4)||3x3,64 (1)||3x3,64 (2)||3x3,64 (2)||3x3,64 (2)||7x7,96/2(1)||7x7,96/2(1)||7x7,96/2(1)||7x7,64/2|
|3x3 pool/2||2x2 pool/2||3x3 pool/2|
|112||3x3,128 (1)||3x3,128 (2)||3x3,128 (2)||3x3,128 (2)|
|56||3x3,256 (2)||3x3,256 (2)||3x3,256 (3)||3x3,256 (4)||3x3,256 (5)||3x3,256 (6)||3x3,384 (6)||3x3,64 (6)|
|28||5x5,256 (1)||3x3,512 (2)||3x3,512 (2)||3x3,512 (3)||3x3,512 (4)||3x3,512 (5)||3x3,512 (6)||3x3,768 (6)||3x3,128 (7)|
|3x3 pool/2||2x2 pool/2||3x3,256/2 (1)|
|14||3x3,384 (2)||3x3,512 (2)||3x3,512 (2)||3x3,512 (3)||3x3,512 (4)||3x3,512 (5)||3x3,512 (6)||3x3,896 (6)||3x3,256 (11)|
|3x3,256 (1)||1x1,512 (1)|
|3x3 pool/2||2x2 pool/2||spp,7,3,2,1||3x3,512/2 (1)|
/stride (t), where t is the number of such layers. Stride is 1 unless explicitly mentioned. Layer* denotes convolution layer with private kernels.
The Newton architecture takes the baseline analog accelerator ISAAC and incrementally applies innovations discussed earlier. We begin by describing results for optimizations targeting global components such as h-tree followed by tile and IMA level techniques. As mentioned earlier, while we build on ISAAC and use it for evaluation, the proposed enhancements to crossbar are applicable to any analog architecture.
Constrained Mapping for Compact HTree
We first observe that the ISAAC IMA is designed with an over-provisioned HTree that can handle a worst-case mapping of the workload. We imposed the constraint that an IMA can only handle a single layer, and a maximum of 128 inputs. This restricts the width of the HTree, promotes input sharing, and enables reduction of partial neuron values at the junctions of the HTree. While this helps shrink the size of an IMA, it suffers from crossbar under-utilization within an IMA. We consider different IMA sizes, ranging from which supplies the same 128 neurons to 4 crossbars to get 64 output neurons, to . Figure 10 plots the average under-utilization of crossbars across the different workloads in the benchmark suite. For larger IMA sizes, the under-utilization is quite significant. Larger IMA sizes also result in complex HTrees. Therefore, a moderately sized IMA that processes 128 inputs for 256 neurons has high computational efficiency and low crossbar under-utilization. For this design, the under-utilization is only 9%. Figure 11 quantifies how our constrained mapping and compact HTree improve area, power, and energy per workload. In short, our constraints have improved area efficiency by 37% and power/energy efficiency by 18%, while leaving only 9% of crossbars under-utilized.
Heterogeneous ADC Sampling
The heteregenous sampling of outputs using adaptive ADCs has a big impact on reducing the power profile of the analog accelerator. In one iteration of 100 ns, at max 4 ADCs work at the max resolution of 8-bits. Power supply to the rest of the ADCs can be reduced. We measure the reduction of area, power, and energy with respect to the new IMA design with the compact HTree. Since ADC contributed to 49% of the chip power in ISAAC, reducing the oversampling of ADC reduces power requirement by 15% on average. The area efficiency improves as well since the output-HTree now carries 16-bits instead of unnecessarily carrying 39-bits of final output. The improvements are shown in Figure 12.
We further try to reduce the power profile with divide-and-conquer within an IMA. Figure 13 shows the impact of recursively applying the divide-and-conquer technique multiple times. Applying it once is nearly as good as applying it twice, and much less complex. Therefore, we focus on a single divide-and-conquer step. Improvements are reported in Figure 14. Energy efficiency improves by almost 25% over the previous design point, because ADCs end up being used 75% of the times in the 1700 ns window. However, this comes at a cost of 6.4% reduction in area efficiency because of the need for more crossbars and increase in HTree bandwidth to send the sum of inputs.
eDRAM Buffer Requirements
In Figure 15, we report the buffer requirement per tile when the layers are spread across many tiles. We consider this for a variety of tile/IMA configurations. Image size has a linear impact on the buffering requirement. For 256256 images, the buffer reduction technique leads to the choice of a 16 KB buffer instead of the 64 KB used in ISAAC, a 75% reduction. Figure 16 shows 6.5% average improvement in area efficiency because of this technique.
Conv-Tiles and Classifier-Tiles
Figure 17 plots the decrease in power requirement when FC tiles are operated at , and slower than the conv tiles. None of these configurations lower the throughput as the FC layer is not on the critical path. Since ADC power scales linearly with sampling resolution, the power profile is lowest when the ADCs work slower. This leads to 50% lower peak power on average. In Figure 18, we plot the increase in area efficiency when multiple crossbars share the same ADC in FC tiles. The underutilization of FC tiles provides room for making them storage efficient, saving on average 38% of chip area. We do not increase the ratio beyond 4 because the multiplexer connecting the crossbars to the ADC becomes complex. Resnet does not gain much from the heterogeneous tiles because it needs relatively fewer FC tiles.
Strassen’s optimization is especially useful when large matrix multiplication can be performed in the conv layers without much wastage of crossbars. This provides room for decomposition of these large matrices, which is the key part of Strassen’s technique. We note that Resnet has high wastage when using larger IMAs, and thus does not benefit at all from this technique. Overall, Strassen’s algorithm increases the energy efficiency by 4.5% as seen in Figure 19.
Putting it all together.
Figure 20 plots the incremental effect of each of our techniques on peak computational and power efficiency of DaDianNao, ISAAC, and Newton. We do not include the heterogeneous FC tile in this plot because it is forcibly operated slowly because it is non-critical; as a result, it’s peak throughput is lower by definition. We see that both adaptive ADC and divide & conquer play a significant role in increasing the PE. While the impact of Strassen’s technique is not visible in this graph, it manages to free up resources (1 every 8 IMA) in a tile, thus providing room for more compact mapping of networks, and reducing ADC utilization.
Figure 21 shows a per-benchmark improvement in area efficiency and the contribution of each of our techniques. The compact HTree and the FC tiles are the biggest contributors. Figure 22 similarly shows a breakdown for decrease in power envelope, and Figure 23 does the same of improvement in energy efficiency. Multiple innovations (HTree, adaptive ADC, Karatsuba, FC tiles) contribute equally to the improvements. We also observed that the Adaptive ADC technique’s improvement is not very sensitive to the ADC design. We evaluated ADCs where the CDAC power dissipates 10% and 27% of ADC power; the corresponding improvements with the Adaptive ADC were 13% and 12% respectively.
Figure 24 compares the 8-bit version of Newtonwith Google’s TPU architecture. Note that while Google has already announced second generation TPU with 16-bit support, its architectural details are not public yet. Hence, we limit our analysis to TPU-1. Also, we scale the area such that the die area is same for both the architectures, i.e. an iso-area comparison. For TPU, we perform batch processing enough to not exceed the latency target of 7ms as demanded by most application developers. Since Newtonpipeline is deterministic and as its crossbars are statically mapped to different layers, the latency of images is always the same irrespective of batch size, which is comfortably less than 7ms for all the evaluated benchmarks. We also model TPU-1 with GDDR5 memory to allocate sufficient bandwidth.
Figure 24 shows throughput and energy improvement of Newtonover TPU for various benchmarks. Newtonhas an average improvement of 10.3 in throughput and 3.4 in energy over TPU. In terms of computational efficiency (CE) calculated using peak throughput, Newtonis 12.3 better than TPU. However, when operating on FC layer, due to idle crossbars in Newton, this advantage reduces to 10.3 for actual workloads.
When considering power efficiency (PE) calculated using peak throughput and area, although Newtonis only 1.6 better than TPU, the actual benefit goes up for real workloads, increasing it to 3.4. This is because of TPU’s low memory bandwidth coupled with reduced batch size for some workloads. As we discussed earlier, the batch size in TPU is adjusted to meet the latency target. Since large batch size alleviates memory bandwidth problem, reducing it to meet latency target directly impacts power efficiency due to more GDDR fetches and idle processing units.
From the figure, it can also be noted that the throughput improvement of Alexnet and Resnet aren’t as high as the other benchmarks because of their relatively small networks. This increases the batch size, improving the data locality for FC layer weights. On the other hand, the MSRA3 benchmark has higher energy consumption than other workloads because for MSRA3, TPU can process only one image per batch. This dramatically increases TPU’s idle time while fetching a large number of weights for the FC layers. In short, Newton’s in-situ computation achieves superior energy and performance values over TPU as the proposed design limits data movement while reducing analog computation overhead.
In this work, we target resource provisioning and efficiency in a crossbar-based deep network accelerator. Starting with the ISAAC architecture, we show that three approaches – heterogeneity, mapping constraints, and divide & conquer – can be applied within a tile and within an IMA. This results in smaller eDRAM buffers, smaller HTree, energy-efficient ADCs with varying resolution, energy- and area-efficiency in classifier layers, and fewer computations. Many of these ideas would also apply to a general accelerator for matrix-matrix multiplication, as well as to other neural networks such as RNN, LSTM, etc. The Newton architecture cuts the current gap between ISAAC and an ideal neuron in half.
Appendix: Crossbar Implementations
This appendix discusses how crossbars can be designed to withstand noise effects in analog circuits.
Process Variation and Noise: Since an analog crossbar uses actual conductance of individual cells to perform computation, it is critical to do writes at maximum precision. We make two design choices to improve write precision. First, we equip each cell with an access transistor (1T1R cell) to precisely control the amount of write current going through it. While this increases area overhead, it eliminates sneak currents and their negative impact on write voltage variation . Second, we use a closed loop write circuit with current compliance that does many iterations of program-and-verify operations. Prior work has shown that such an approach can provide more precise states at the cost of increased write time even with high process variation in cells .
In spite of a robust write process, a cell’s resistance will still deviate from its normal value within a tolerable range. This range will ultimately limit either the number of levels in a cell or the number of simultaneously active rows in a crossbar. For example, if a cell write can achieve a resistance within r (r is a function of noise and parasitic), if is the number of levels in a cell, and is the max range of resistance of a cell, then we set the number of active rows to /(.r) to ensure there are no corrupted bits at the ADC.
Crossbar Parasitic: While a sophisticated write circuit coupled with limited array size can help alleviate process variation and noise, IR drop along rows and columns can also reduce crossbar accuracy. When a crossbar is being written during initialization, the access transistors in unselected cells shut off the sneak current path, limiting the current flow to just the selected cells. However, when a crossbar operates in compute mode in which multiple rows are active, the net current in the crossbar increases, and the current path becomes more complicated. With access transistors in every cell in the selected rows in ON state, a network of resistors is formed with every cell conducting varying current based on its resistance. As wire links connecting these cells have non-zero resistance, the voltage drop along rows and columns will impact the computation accuracy. Thus, a cell at the far end of the driver will see relatively lower read voltage compared to a cell closer to the driver. This change in voltage is a function of both wire resistance and the current flowing through wordlines and bitlines, which in turn is a function of the data pattern in the array. This problem can be addressed by limiting the DAC voltage range and doing data encoding to compensate for the IR drop . Since the matrix being programmed into a crossbar is known beforehand, during the initialization phase of a crossbar, it is possible to account for voltage drops and adjust the cell resistance appropriately. Hu et al.  have demonstrated successful operation of a 256256 crossbar with 5-bit cells even in the presense of thermal noise in memristor, short noise in circuits, and random telegraphic noise in the crossbar. For this work, a conservative model with a 128128 crossbar with 2-bit cells and 1-bit DAC emerges as an ideal design point in most experiments.
bstctl:etal, bstctl:nodash, bstctl:simpurl
-  “The Tomorrow Show: Three New Technologies from Hewlett Packard Labs,” 2016, https://youtu.be/tABpRpBW6h0?t=18m38s.
-  J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Jerger, and A. Moshovos, “Cnvlutin: Zero-Neuron-Free Deep Convolutional Neural Network Computing,” in Proceedings of ISCA-43, 2016.
-  F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High Precision Tuning of State for Memristive Devices by Adaptable Variation-Tolerant Algorithm,” Nanotechnology, 2012.
-  T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning,” in Proceedings of ASPLOS, 2014.
-  Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in Proceedings of ISCA-43, 2016.
-  Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “DaDianNao: A Machine-Learning Supercomputer,” in Proceedings of MICRO-47, 2014.
-  P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A Novel Processing-In-Memory Architecture for Neural Network Computation in ReRAM-based Main Memory,” in Proceedings of ISCA-43, 2016.
-  Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting Vision Processing Closer to the Sensor,” in Proceedings of ISCA-42, 2015.
-  M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” in Proceedings of ASLPOS-22, 2017.
-  S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in Proceedings of ICML, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv preprint arXiv:1512.03385, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification,” in Proceedings of ICCV, 2015.
-  M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, R. S. Williams, and J. Yang, “Dot-Product Engine for Neuromorphic Computing: Programming 1T1M Crossbar to Accelerate Matrix-Vector Multiplication,” in Proceedings of DAC-53, 2016.
-  A. Kahng, B. Li, L.-S. Peh, and K. Samadi, “ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration,” in Proceedings of DATE, 2009.
-  D. Kim, J. H. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory,” in Proceedings of ISCA-43, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proceedings of NIPS, 2012.
-  L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Brandli, M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici, “A 3.1 mW 8b 1.2 GS/s Single-Channel Asynchronous SAR ADC with Alternate Comparators for Enhanced Speed in 32 nm Digital SOI CMOS,” Journal of Solid-State Circuits, 2013.
-  D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, “PuDianNao: A Polyvalent Machine Learning Accelerator,” in Proceedings of ASPLOS-20, 2015.
-  S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An Instruction Set Architecture for Neural Networks,” in Proceedings of ISCA-43, 2016.
-  E. J. Merced-Grafals, N. Dávila, N. Ge, R. S. Williams, and J. P. Strachan, “Repeatable, Accurate, and High Speed Multi-Level Programming of Memristor 1T1R Arrays for Power Efficient Analog Computing Applications,” Nanotechnology, 2016.
-  N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0,” in Proceedings of MICRO, 2007.
-  B. Murmann, “ADC Performance Survey 1997-2015 (ISSCC & VLSI Symposium),” 2015, http://web.stanford.edu/~murmann/adcsurvey.html.
-  B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. M. Hernandez, Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators,” in Proceedings of ISCA-43, 2016.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet Large Scale
Visual Recognition Challenge,”
International Journal of Computer Vision, 2014.
-  A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars,” in Proceedings of ISCA, 2016.
-  S.Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proceedings of ISCA, 2016.
-  K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  N. Verma and A. P. Chandrakasan, “An ultra low energy 12-bit rate-resolution scalable SAR ADC for wireless sensor nodes,” IEEE Journal of Solid-State Circuits, 2007.
-  Y.Li, Y.Zhang, S.Li, P.Chi, C.Jiang, P.Qu, Y.Xie, and W.Chen, “NEUTRAMS: Neural Network Transformation and Co-design under Neuromorphic Hardware Constraints,” in Proceedings of MICRO, 2016.
-  M. Zangeneh and A. Joshi, “Design and Optimization of Nonvolatile Multibit 1T1R Resistive RAM,” Proceedings of the Transactions on VLSI Systems, 2014.