Deep Neural Networks (DNNs) are revolutionizing a wide range of services and applications such as language translation , transportation , intelligent search , e-commerce , and medical diagnosis . These benefits are predicated upon delivery on performance and energy efficiency from hardware platforms. With the diminishing benefits from general-purpose processors [6, 7, 8, 9], there is an explosion of digital accelerators for DNNs [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Mixed-signal acceleration [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42] is also gaining traction. Albeit low-power, mixed-signal circuitry suffers from limited range of information encoding, is susceptible to noise, imposes Analog to Digital (A/D) and Digital to Analog (D/A) conversion overheads, and lacks fine-grained control mechanism. Realizing the full potential of mixed-signal technology requires a balanced design that brings mathematics, architecture, and circuits together.
This paper sets out to explore this conjunction of areas by inspecting the mathematical foundation of deep neural networks. Across a wide range of models, the large majority of DNN operations belong to convolution and fully-connected layers [23, 28, 32]. Consequently, based on Amdahl’s Law, our architecture executes these two types of layers in the mixed-signal domain. Nevertheless, to maintain generality for the ever-expanding roster of other layers required by modern DNNs, the architecture handles the other layers digitally. Normally, the convolution and fully-connected layers are broken down into a series of vector dot-products, that generate a scalar and comprise a set of Multiply-Accumulate (MACC) operations. State-of-the-art digital [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] and mixed-signal [32, 33, 35, 36, 37, 38, 39, 40, 34, 43, 41, 42] accelerators use a large array of stand-alone MACC units to perform the necessary computations. When moving to the mixed-signal domain, this stand-alone arrangement of MACC operations imposes significant overhead in the form of A/D and D/A conversions for each operation. The root cause is the high cost of converting the operands and outputs of each MACC to and from the analog domain, respectively.
This paper aims to address the aforementioned list of challenges by making the following three contributions.
(1) This work offers and leverages the insight that the set of MACC operations within a vector dot-product can be partitioned, rearranged, and interleaved at the bit level without affecting the mathematical integrity of the vector dot-product. Unlike prior work [33, 42, 44], this work does not rely on changing the mathematics of the computation to enable mixed-signal acceleration. Instead, it only rearranges the bit-wise arithmetic calculations to utilize lower bitwidth analog units for higher bitwidth operations. The key insight is that a binary value can be expressed as the sum of products similar to dot-product, which is also a sum of multiplications (). Value can be expressed as where s are the individual bits or as , where s are 4-bit partitions for instance. Our interleaved bit-partitioned arithmetic effectively utilizes the distributive and associative property of multiplication and addition at the bit granularity.
The proposed model, first, bit-partitions all elements of the two vectors, and then distributes the MACC operations of the dot-product over the bit partitions. Therefore, the lower bitwidth MACC becomes the basic operator that is applied to each bit-partition. Then, our mathematical formulation exploits the associative property of the multiply and add to group bit-partitions that are at the same significance position. This significance-based rearrangement enables factoring out the power-of-two multiplicand that signifies the position of the bit-partitions. The factoring enables performing the wide group-based low-bitwidth MACC operations simultaneously as a spatially parallel operation in the analog domain, while the group shares a single A/D convertor. The power-of-two multiplicand will be applied later in the digital domain to the accumulated result of the group operation. To this end, we rearchitect vector dot-product as a series of wide (across multiple elements of the two vectors), interleaved and bit-partitioned arithmetic and re-aggregation. Therefore, our reformulation significantly reduces the rate of costly A/D conversion by rearranging the bit-level operations across the elements of the vector dot-product. Using low-bitwidth operands for analog MACCs provides a larger headroom between the value encoding levels in the analog domain. The headroom leads tackles the limited range of encoding and offers higher robustness to noise, an inherent non-ideality in the analog mode. Additionally, using lower bitwidth operands reduces the energy/area overhead imposed by A/D and D/A convertors that roughly scales exponentially with the bitwidth of operands.
(2) At the circuit level, the accelerator is designed using switched-capacitor circuitry that stores the partial results as electric charge over time without conversion to the digital domain at each cycle. The low-bitwidth MACCs are performed in charge domain with a set of charge-sharing capacitors. This design choice lowers the rate of A/D conversion as it implements accumulation as a gradual storage of charge in a set of parallel capacitors. These capacitors not only aggregate the result of a group of low-bitwidth MACCs, but also enable accumulating results over time. As such, the architecture enables dividing the longer vectors into shorter sub-vectors that are multiply-accumulated over time with a single group of low-bitwidth MACCs. The results are accumulated over multiple cycles in the group’s capacitors. Because the capacitors can hold the charge from cycle to cycle, the A/D conversion is not necessary in each cycle. This reduction in rate of A/D conversion is in addition to the amortized cost of A/D convertors across the bit-partitioned analog MACCs of the group.
(3) Based on these insights, we devise a clustered 3D-stacked microarchitecture, dubbed , that provides the capability to integrate copious number of low-bitwidth switched-capacitor MACC units that enables the interleaved bit-partitioned arithmetic. The lower energy of mixed-signal computations offers the possibility of integrating a larger number of these units compared to their digital counterpart. To efficiently utilize the more sizable number of compute units, a higher bandwidth memory subsystem is needed. Moreover, one of the large sources of energy consumption in DNN acceleration is off-chip DRAM accesses [30, 28, 23]. Based on these insights, we devise a clustered architecture for that leverages 3D-stacking for its higher bandwidth and lower data transfer energy.
Evaluating the carefully balanced design of with ten DNN benchmarks shows that delivers speedup over the leading purely digital 3D-stacked DNN accelerator, , with only 0.5% loss in accuracy achieved after mitigating noise, computation error, and Process-Voltage-Temperature (PVT) variations. With 8-bit execution, offers and higher Performance-per-Watt compared to RTX 2080 TI and Titan Xp, respectively. With these benefits, this paper marks an initial effort that paves the way for a new shift in DNN acceleration.
2 Wide, Interleaved, and Bit-Partitioned Arithmetic
A key idea of this work is the mathematical insight that enables utilizing low bitwidth mixed-signal units in spatially parallel groups. This section demonstrates this insight.
Bit-Level partitioning and interleaving of MACCs. To further detail the proposed mathematical reformulation, Figure 1(a) delves into the bit-level operations of dot-product on vectors with 2-elements containing 4-bit values. As illustrated with different colors, each 4-bit element can be written in the form of sum of 2-bit partitions multiplied by powers of 2 (shift). As discussed, vector dot-product is also a sum of multiplications. Therefore, by utilizing the distributive property of addition and multiplication, we can rewrite the vector-dot product in terms of the bit partitions. However, we also leverage the associativity of the addition and multiplication to group the bit-partitions in the same positions together. For instance, in Figure 1, the black partitions that represent the Most Significant Bits (MSBs) of the vector are multiplied in parallel to the teal222Color teal in Figure 1 is the darkest gray in black and white prints. partitions, representing the MSBs of the . Because of the distributivity of multiplication, the shift amount of (2+2) can be postponed after the bit-partitions are multiply-accumulated. The different colors of the boxes in Figure 1 illustrates the interleaved grouping of the bit-partitions. Each group is a set of spatially parallel bit-partitioned MACC operations that are drawn from different elements of the two vectors. The low-bitwidth nature of these operations enables execution in the analog domain without the need for A/D conversion for each individual bit-partitioned operation. As such, our proposed reformulation amortizes the cost of A/D conversion across the bit-partitions of different elements of the vectors as elaborated below.
Wide, interleaved, and bit-partitioned vector dot-product. Figure 1(b) illustrates the proposed vector dot-product operation with 4-bit elements that are bit partitioned to 2-bit sub-elements. For instance, as illustrated, the elements of vector , denoted as , are first bit partitioned to and . The former represents the two Least Significant Bits (LSBs) and the latter represents the Most Significant Bits (MSBs). Similarly, the elements of vector are also bit partitioned to the and sub-elements. Then, each vector (e.g., ) is rearranged into two bit-partitioned sub-vectors, and . In the current implementations of architecture, the size of bit-partition is fixed across the entire architecture. Therefore, the rearrangement is just rewiring the bits to the compute units that imposes modestly minimal overhead (less than 1%). Figure 1 is merely an illustration and there is no need for extra storage or movement of elements. As depicted with color coding, after the rewiring, represents all the least significant bit-partitions from different elements of vector , while the MSBs are rewired in . The same rewiring is repeated for the vector . This rearrangement, puts all the bit-partitions from all the elements of the vectors with the same significance in one group, denoted as , , , . Therefore, when a pair of the groups (e.g., and in Figure 1(c)) are multiplied to generate the partial products, (1) the shift amount (“” in this case) is the same for all the bit-partitions and (2) the shift can be done after partial products from different sub-elements are accumulated together.
As shown in Figure 1(c), the low-bitwidth elements are multiplied together and accumulated in the analog domain. Accumulation in the digital domain would require an adder tree which is costly compared to the analog accumulation that merely requires connectivity between the multiplier outputs. It is only after several analog multiply-accumulations that the results are converted back to digital for shift and aggregation with partial products from the other groups. The size of the vectors usually exceeds the number of parallel low-bitwidth MACCs, in which case the results need to be accumulated over multiple iterations. As will be discussed in the next section, the accumulations are performed in two steps. The first step accumulates the results in the analog domain through charge accumulation in capacitors before A/D convertors (see Figure 1(c)). In the second step, these converted accumulations will be added up in the digital domain using a register. For this pattern of computation, we are effectively utilizing the distributive and associative property of multiplication and addition for dot-product but at the bit granularity. This rearrangement and spatially parallel (i.e., wide) bit-partitioned computation is in contrast with temporally bit-serial digital [17, 13, 31, 45] and analog  DNN accelerators.
The next section describes the architecture of the mixed-signal accelerator that leverages our mathematical reformulation. This architecture is essentially a collection of the structure that is depicted in Figure 1(c). The structure is the Mixed-Signal Wide Aggregator () that spatially aggregates the results from its four units as illustrated. Each of these four units, which are also wide, is a Mixed-Signal Bit-Partitioned MACC (). Note that the number of in a is a function of the bitwidth of the vector elements and the value of bit-partitioning.
3 Mixed-Signal Architecture Design for Wide Bit-Partitioning
To exploit the aforementioned arithmetic, comes with a mixed-signal building block that performs wide bit-partitioned vector dot-product. then organizes these building blocks in a clustered hierarchical design to efficiently make use of its copious number of parallel low-bitwidth mixed-signal MACC units. The clustered design is crucial as mixed-signal paradigm enables integrating a larger number of parallel operators than the digital counterpart.
3.1 Wide Bit-Partitioned Mixed-Signal MACC
As Figure 2(a) shows, the building block of is a collection of low-bitwidth analog MACCs that operate in parallel on sub-elements from the two vectors under dot-product. This wide structure is dubbed . We design the low-bitwidth MACCs using switched-capacitor circuitry for the following reason. This design choice lowers the rate of A/D conversion as it implements accumulation as a gradual storage of charge in a set of parallel capacitors. These capacitors not only aggregate the results of low-bitwidth MACCs, but also enable accumulating results over time. As such, longer vectors are divided into shorter sub-vectors that are multiply-accumulated over time without the need to convert the intermediate results back to the digital domain. It is only after processing multiple sub-vectors that the accumulated result is converted to digital, significantly reducing the rate of costly A/D conversions. As shown in Figure 2(a), each low-bitwidth MACC unit is equipped with its own pair of local capacitors, which perform the accumulation over time across multiple sub-vectors. As will be discussed in Section 4, the pair is used to handle positive and negative values by accumulating them separately on one or the other capacitor. After a pre-determined number of private accumulations in the analog domain, the partial results need to be accumulated across the low-bitwidth MACCs. In that cycle, the transmission gates between the capacitors (Figure 2(a)) connect them and a simple charge sharing between the capacitors yields the accumulated result for the . That is when a single A/D conversion is performed, the cost of which is not only amortized across the parallel MACC units but also over time across multiple sub-vectors.
3.2 Mixed-Signal Wide Aggregator
only process low-bitwidth operands; however, they cannot combine these operations to enable higher bit-width dot-products. A collection of can provide this capability as discussed with Figure 1 in Section 2. This structure is named as it is a Mixed Signal Wide Aggregator. Figure 2(b) depicts a 2D array of a possible design, comprising 16 that are necessary to perform 8-bit by 8-bit vector dot-product with 2-bit partitioning. In this case, the number 16 comes from the fact that each of the two 8-bit operands can be partitioned to four 2-bit values. Each of the four 2-bit partitions of the multiplicand need to be multiply-accumulated with all the multiplier’s four 2-bit partitions. As discussed in Section 2, each also performs the necessary shift operations to combine the low-bitwidth results from its 16 . By aggregating the partial results of each , the unit generates a scalar output which is stored on its output register. As illustrated in Figure 2, a collection of these constitute an accelerator core from which the clustered architecture of is designed.
3.3 Hierarchically Clustered Architecture
As discussed in Section 4, the proposed consumes less energy for a single 8-bit MACC in comparison with a digital logic (1 pJ taken from the simulator , which is commensurate with other reports [47, 48]). As such, it is possible to integrate a larger number of mixed-signal compute units on a chip with a given power budget compared to a digital architecture. To efficiently utilize the larger number of available compute units, a high bandwidth memory substrate is required. Moreover, one of the large sources of energy consumption in DNN acceleration is off-chip DRAM accesses [30, 28, 23]. To maximize the benefits of the mixed-signal computation, 3D-stacked memory is an attractive option since it reduces the cost of data accesses and provides a higher bandwidth for data transfer between the on-chip compute and off-chip memory [12, 25]. Based on these insights, we devise a clustered architecture for with a 3D-stacked memory substrate as shown in Figure 2(c). The mixed-signal logic die of is stacked over the DRAM dies with multiple vaults, each of which is connected to the logic die with several through-silicon-via (TSV)s. The 3D memory substrate of is modeled using Micron’s Hybrid Memory Cube (HMC) [49, 50] which has been shown to be a promising technology for DNN acceleration . As the results in Section 8.2 Figure 15 shows, a flat systolic design would result in significant underutilization of the compute resources and bandwidth from 3D stacking.
Therefore, is a hierarchically clustered architecture that allocates multiple accelerator cores as a cluster to each vault. Figure 2(b) depicts a single core. As shown in Figure 2(b), each core is self-sufficient and packs a mixed-signal systolic array of as well as the digital units that perform pooling, activation, normalization, etc. The mixed-signal array is responsible for the convolutional and fully connected layers. Generally, wide and interleaved bit-partitioned execution within is orthogonal to the organization of the accelerator architecture. This paper explores how to embed them and the proposed compute model, within a systolic design and enables end-to-end programmable mixed-signal acceleration for a variety of DNNs.
Accelerator core. As Figure 2(b) depicts, the first level of hierarchy is the accelerator core and its 2D systolic array that utilizes the . As depicted, the Input Buffers and Output Buffers are shared across the columns and rows, respectively. Each has its own Weight Buffer. This organization is commensurate with other designs and reduces the cost of on-chip data accesses as inputs are reused with multiple filters . However, what makes our design different is the fact that each buffer needs to supply a sub-vector not a scalar in each cycle to the . However, the generates only a scalar since dot-product generates a scalar output. The rewiring of the inputs and weights is already done inside the since the size of bit-partitions is fixed. As such, there is no need to reformat any of inputs, activations, or weights. As the outputs of flow down the columns, they get accumulated to generate the output activations that are fed to each columns dedicated Normalization/Activation/Pooling Units. To preserve the accuracy of the DNN model, the intermediate results are stored as 32-bit digital values and intra-column aggregations are performed in the digital mode.
On-chip data delivery for accelerator cores. To minimize data movement energy and maximally exploit the large degrees of data-reuse offered by DNNs, uses a statically-scheduled bus that is capable of multicasting/broadcasting data across accelerator cores. Compared to complex interconnections, the choice of statically-scheduled bus significantly simplifies the hardware by alleviating the need for complicated arbitration logic and FIFOs/buffers required for dynamic routing. Moreover, the static schedule enables the compiler stack to cut the DNN layers across cores while maximizing inter- and intra-core data-reuse. The static schedule is encoded in the form of data communication instructions (Section 7) that are responsible for (1) fetching data tiles from the 3D-stacked memory and distributing them across cores or (2) writing output tiles back from the cores to the memory.
Parallelizing computations across accelerator cores. Data-movement energy is a significant portion of the overall energy consumption both for digital designs [12, 23, 24, 28, 30, 51] and analog designs [33, 35]. As such, the clustered architecture (1) divides the computations into tiles that fit within the limited on-chip capacity of the scratchpads that are private for each accelerator core, and (2) cuts the tiles of computations across cores to minimize DRAM accesses by maximally utilizing the multicast/broadcasting capabilities of on-chip data delivery network. To simplify the design of the accelerator cores, the scratchpad buffers are private to each core and the shared data is replicated across multiple cores. Thus, a single tile of data can be read once from the 3D-stacked memory and then be broadcasted/multicasted across cores to reduce DRAM accesses. The cores use double-buffering to hide the latency for memory accesses for subsequent tiles. The accelerator cores use output-stationary dataflow that minimizes the number of ADC conversions by accumulating results in the charge-domain. Section 6 discusses the compiler stack that optimizes the cuts and tile sizes for individual DNN layers.
4 Switched-Capacitor Circuit Design for Bit-Partitioning
exploits switched-capacitor circuitry [36, 34, 43, 42, 41] for by implementing MACC operations in the charge-domain rather than using resistive-ladders to compute in current domain [32, 40, 44]. Compared to the current-domain approach, switched-capacitors (1) enable result accumulation in the analog domain by storing them as electric charge, eliminating the need for A/D conversion at every cycle, and (2) make multiplications depend only on the ratio of the capacitor sizes rather than their absolute capacitances. The second property enables reduction of capacitor sizes, improving the energy and area of MACC units as well as making them more resilient to process variation. The following discusses the details of the circuitry.
4.1 Low-Bitwidth Switched-Capacitor MACC
Figure 3 depicts the design of a single 3-bit sign-magnitude MACC. The and denote the bit-partitions operands. The result of each MACC operation is retained as electric charge in the accumulating capacitor (). In addition to , the MACC unit contains two capacitive Digital-to-Analog Converters, one for inputs () and one for weights (). The and convert the 2-bit magnitude of the input and weight to the analog domain as an electric charge proportional to and respectively. and are each composed of two capacitors ((, ) and (, )) which operate in parallel and are combined to convert the operands to analog domain. Each of these capacitors are controlled by a pair of transmission gates which determine if a capacitor is active or inactive. Another set of transmission gates connects the two and shares charge when partitions of and are multiplied. The resulting shared charge is stored on either or depending on the “sign” control signal produced by . During multiplication, the transmission gates are coordinated by a pair of complimentary non-overlapping clock signals, and .
Charge-domain MACC. Figure 4 shows the phase-by-phase process of a MACC and its corresponding active circuits, the phases of which are described below.
: The first phase (Figure 4(a)) consists of the input capacitive DAC converting digital input () to a charge proportional to the magnitude of the input . As a result, the sampled charge () in in the first phase is equal to:
: In the second phase (Figure 4(b)), the multiplication happens via a charge-sharing process between and . converts the to the charge domain. At the same time, the redistributes its sampled charge () over all of its capacitors () as well as the equivalent capacitor of . The voltage () at the junction of and is as follows:
Because the sampled charge is shared with the weight capacitors, the stored charge () on is equal to:
Equation 3 shows that the stored charge on is proportional to , but includes a non-linearity due to the term in the denominator. To suppress this non-linearity, and must be chosen such that . Although this design choice does not completely suppress this non-linearity, it can be mitigated as discussed in Section 5. With this choice, becomes .
: In the last phase, (Figure 4(c)), the charge from multiplication is shared with for accumulation. The sign bits ( and ) determine which of or is selected for accumulation. The sampled charge by is then redistributed over the selected as well as all the capacitors of (). Theoretically, must be infinitely larger than to completely absorb the charge from multiplication. However, in reality, some charge remains unabsorbed, leading to a pattern of computational error, which is mitigated as discussed in Section 5 Ideally, the voltage on is:
While the charge sharing and accumulation happens on , a new input is fed into , starting a new MACC process in a pipelined fashion. This process repeats for all low-bitwidth MACC units over multiple cycles before one A/D conversion.
4.2 Wide Mixed-Signal Bit-Partitioned MACC
Figure 5(a) depicts an array of switched-capacitor MACCs, constituting the unit, which perform operations for cycles in the analog domain and store the results locally on their . Figure 5(b) depicts the control signals and cycles of operations. For the microarchitecture, and are selected to and based on design space exploration (see Figure 14). Over cycles, the results of low-bitwidth MACC operations get accumulated in , private to each MACC unit. In cycle , the private results get aggregated across all the MACC units within the . The single A/D converter in the is responsible for converting the aggregated result, which also starts at cycle .
In the first phase of cycle , all the accumulating capacitors which store the positive values () are connected together through a set of transmission gates to share their charge. Simultaneously, the same process happens for the . in Figure 5 is the control signal which connects the . The accumulating capacitors (), are also connected to a Successive Approximation Register (SAR) ADC and share their stored charge with the Sample and Hold block (S&H) of the ADC. This (S&H) block has differential inputs which samples the positive and negative results separately, subtracts them and holds them for the process of A/D conversion. In the second phase of the cycle , connects all the to ground to clear them for the next iteration of wide, bit-interleaved calculations.
There is a tradeoff between resolution and sampling rate of ADC, which also defines its topology. SAR ADC is a better choice when it comes to medium resolution (8-12 bits) and sampling rate (1-500 Mega-Samples/sec). We choose a 10-bit, 15 Mega-Samples/sec SAR ADC as it strikes the better balance between speed and resolution for . The design space exploration in Figure 14 shows that this choice makes the grouping of 8 low-bitwidth MACCs optimal for cycles of operation. The process of A/D conversion takes cycles, pipelined with the sub-vector dot-product. Table 1 shows the energy breakdown within a that uses 2-bit partitioning. As shown, performing an 8-bit MACC using the interleaved bit-partitioned arithmetic requires less energy than a digital MACC which consumes around 1 pJ .
|Units||Energy (femto Joule)|
|1 MACC||5.1 fJ|
|256 MACCs||1,305.6 fJ|
|SAR ADC (for 256 MACCs)||1,660.0 fJ|
|Total Energy||1,956.6 fJ|
|Total Energy per 2b-2b MACC||11.6 fJ|
|Total Energy per 8b-8b MACC||185.3 fJ|
5 Mixed-Signal Non-Idealities and Their Mitigation
Although analog circuitry offers significant reduction in energy, they might lead to accuracy degradation.
Thus, their error needs to be properly modeled and accounted for.
Specifically, , the main analog component, can be susceptible to (1) thermal noise, (2) computational error caused by incomplete charge transfer, and (3) PVT variations.
Traditionally, analog circuit designers mitigate sources of error by just configuring hardware parameters to values which are robust to non-idealities.
Such hardware parameter adjustments require rather significant energy/area overheads that scale linearly with number of modules.
The overheads are acceptable in conventional analog designs since modules are few in numbers.
However, due to the repetitive and scaled-up nature of our design, we need to mitigate these non-idealities in a higher and algorithmic level.
We leverage the training algorithm’s inherent mechanism to reduce error (loss) and use mathematical models to represent these non-idealities.
We, then, apply these models during the forward pass to adjust and fine-tune pre-trained neural models with just a few more epochs
adjust and fine-tune pre-trained neural models with just a few more epochsacross the chips within a technology node. The rest of this section details non-idealities and their modeling. It, then elaborates on how PVT variations are considered in formulations.
5.1 Thermal Noise
Thermal noise is an inherent perturbation in analog circuits caused by the thermal agitation of electrons, distorting the main signal. This noise can be modeled according to a normal distribution, where the ideal voltage deviates relative to a value comprised of the working temperature (T), Boltzmann constant (k), and capacitor size (C) which produce the deviation. Within , switched-capacitor MACC units are mainly effected by the combined thermal noise resulting from weights and accumulator capacitors (and respectively). The noise from these capacitors gets accumulated during the cycles of computation for each individual MACC unit and then gets aggregated across the MACC units in . By applying the thermal noise equation used for similar MACC units 
to a unit, the standard deviation at the output is described by Equation5:
In the above equation, is equal to . We apply the effect of thermal noise in the forward propagation of DNN by adding an error tensor to the output of convolutional and fully connected layers. Having computed the standard deviation of noise for a single (), each element of the error tensor is sampled from a normal distribution as follows:
In the above equation, is scaled by which is the amount of operations required to generate one element in the output feature map as well as the amount of total bit-shifts applied to each result by unit, .
5.2 Computational Error
Another source of error in ’s charge-domain computations arises when charge is shared between capacitors during the multiplication and accumulation. Within each MACC unit, the input capacitors () transfer a sampled charge to the weight capacitors () to produce charge proportional to the multiplication result. But the resulting charge is subject to error dependent on the ratio of weight and input capacitor sizes () as shown in Equation 3. This shared charge in the weight capacitors introduces more error when it is redistributed to the accumulating capacitor () which cannot absorb all of the charge, leaving a small portion remaining on the weight capacitors in subsequent cycles. The ideal voltage () produced after cycles of multiplication can be derived from Equation 4 as follows:
By considering the computational error from incomplete charge sharing, the actual voltage at the accumulating capacitor after cycles of MACC operations () becomes:
Computational error is accounted for in the fine-tuning pass by including the multiplicative factors shown in Equation 8 in weights. During the forward pass, the fine-tuning algorithm decomposes weight tensors in convolutional and fully-connected layers into groups corresponding to configuration and updates the individual weight values () to new values () with the computational error in Equation 9:
5.3 Process-Voltage-Temperature Variations
Process variations. We use the sizing of the capacitors to provision and mitigate for the process variations to which the switched-capacitor circuits are generally robust. The robustness and the mitigation are effective because the capacitors are implemented using a number of smaller unit capacitors with common-centroid layout technique . We, specifically, use the metal-fringe capacitors for MACCs with mismatch of just 1% standard deviation  with the max variation of 6% () which is well below the error margins considered for the computational correctness of .
Temperature variations. We model the temperature variations by adding a perturbation term to in Equation 5. We consider the maximum value of the temperature as 358°K which is commensurate with existing practices , and the minimum value as 300°K (This is the peak-to-peak range for the gaussian distribution ()).
Voltage variations. We also model the voltage variation by adding a gaussian distribution to term in Equation 9. Our experiments show that, variations in voltage can be mitigated up to 20%. The extensive amount of vector dot-product operations in DNNs, allows for the minimum and maximum values of the distributions being sampled sufficient amount of times, leading to coverage of the corner cases.
Atop all these considerations, we use differential signaling for ADCs which attenuates the common-mode fluctuations such as PVT variations. To show the effectiveness of our techniques, Figure 6 plots the result of fine-tuning process of two benchmarks, ResNet-50 and VGG-16 for ten epochs. Table 4 reports the summary of accuracy trends for all the benchmarks, which achieve less than 0.5% loss. As Figure 6 shows, the fine-tuning pass compensates the initial loss (0.73% for top-1 and 2.41% for top-5) to only 0.04% for top-1 and 0.02% for top-5. VGG-16 is slightly different and reduces the initial loss (1.16% for top-1 and 2.24% for top-5) to less than 0.18% for top-1 and 0.13% for top-5 validation accuracy. The trends are similar for other benchmarks and omitted due to space constraints.
6 Compiler Stack
DNN specification file. The high-level specification provided in the Caffe2 file is translated to a layer DataFlow Graph (DFG) that preserves the structure of the network. The DFG goes through an algorithm that cuts the DFG and tiles the data to map the DNN computations to the accelerator clusters and cores. The tiling also aims to minimize the transfer of model parameters to limited on-chip scratchpads on the logic die from the 3D-stacked DRAM, while maximizing the utilization of the compute resources. In addition to the DFG, the cutting/titling algorithm takes in the architectural specification of the . These specifications include the organizations and configurations (# rows, #columns) of the clusters, vaults, and cores as well as details of the . To identify the best cuts and tilings, the cutting/tiling algorithm exhaustively searches the space of possibilities, which is enabled through an estimation tool. The tool estimates the total energy consumption and runtime for each cuts/tiles pair which represent the data movement and resource utilization in . Estimation is viable, as the DFG does not change, there is no hardware managed cache, and the accelerator architecture is fixed during execution. Thus, there are no irregularities that can hinder estimation. Algorithm 1 depicts the cutting/tiling procedure. When cuts and tiles are determined, the compiler generates the binary code that contains the communication and computation instruction blocks. As commensurate with state-of-the-art accelerators[12, 28, 23, 25, 18], all the instructions are statically scheduled. We extend the static scheduling to cluster coordination, data communication and transfer.
7 Instruction Set
The ISA exposes the following unique properties of its architecture to the software: (1) efficient mixed-signal execution using bit-partitioned and capacitive accumulation, and (2) clustered architecture, that takes advantage of the power efficiency of mixed-signal acceleration to scale-up the number of in . As such, uses a block-structured ISA that segregates the execution of the DNN into (1) data communication instruction blocks that accesses tiles of data from the 3D-stacked memory and populates the on-chip scratchpads (Input Buffer/Weight Buffer/Output Buffer in Figure 2), and (2) compute instruction blocks each of which consumes the tile of data produced by a corresponding communication instruction block and produces an output tile. The compiler stack statically assigns communication and compute instruction blocks to accelerator clusters, shifting the complexity from hardware to the compiler. By splitting the data transfer and on-chip data processing into separate instructions, the ISA enables software pipelining between clusters and allows the memory accesses to run ahead and fetch data for the next tile while processing the current tile.
Compute instruction block. A block of compute instructions expresses the entire computation to produce a single tile in an accelerator core. Further, the compute block governs how the input data for a DNN layer is bit-partitioned and distributed across wide aggregators within a single core. As such, the compiler has complete control over the read/write accesses to on-chip scratchpads, A/D and D/A conversion, and execution using the and digital blocks in an accelerator core. The granularity of bit-partitioning and charge-based accumulation is determined for each microarchitectural implementation based on the technology node and circuit design paradigm. As such, to support different technology nodes and design styles and allow extensions to the architecture, the ISA encodes the bit-partitioning and accumulation cycles. However, we need to explore the design space to find the optimal design choice for each combination of technology node and circuits (Section 8).
Communication instruction block. The key challenge when scaling up the design is to minimize data-movement while parallelizing the execution of the DNN across the on-chip compute resources. To simplify the hardware, instruction set captures the static schedule of data movement as a series of communication instruction blocks. Static scheduling is possible as the topology of the DNN does not change during inference and the order of layers and neurons is known statically. The compiler stack assigns the communication blocks to the cores according to the order of the layers. This static ordering enables to use a simple statically scheduled bus instead of a more complex interconnection.
To maximize energy efficiency, it is imperative to exploit the high degree of data-reuse offered by DNNs. To exploit data-reuse when parallelizing computations across cores of the architecture, the communication instructions support broadcasting/multicasting to distribute the same data across multiple cores, minimizing off-chip memory accesses. Once a communication block writes a tile of data to the on-chip scratchpads, it can be reused over multiple compute blocks to exploit temporal data locality within a single accelerator core.
|AlexNet ||CNN||Image Classification||Imagenet ||2,678 MOps||56.1 MBytes|
|CIFAR-10 [58, 59]||CNN||Image Classification||CIFAR-10 ||617 MOps||13.4 MBytes|
|GoogLeNet ||CNN||Image Classification||Imagenet||1,502 MOps||13.5 MBytes|
|ResNet-18 ||CNN||Image Classification||Imagenet||4,269 MOps||11.1 MBytes|
|ResNet-50 ||CNN||Image Classification||Imagenet||8,030 MOps||24.4 MBytes|
|VGG-16 ||CNN||Object Recognition||Imagenet||31 GOps||131.6 MBytes|
|VGG-19 ||CNN||Object Recognition||Imagenet||39 GOps||137.3 MBytes|
|YOLOv3 ||CNN||Object Recognition||Imagenet||19 GOps||39.8 MBytes|
|PTB-RNN ||RNN||Language Modeling||Penn TreeBank ||17 MOps||16 MBytes|
|PTB-LSTM ||RNN||Language Modeling||Penn TreeBank||13 MOps||12.3 MBytes|
|Chip||Chip||RTX 2080 TI||Titan Xp|
|On-chip Memory||9216 KB||3698 KB||Memory||11 GB (GDDR6)||12 GB (GDDR5X)|
|Chip Area ()||122.3||56||Chip Area ()||754||471|
|Total Dissipation Power||250 W||250 W|
|Frequency||500 Mhz||500 Mhz||Frequency||1545 Mhz||1531 Mhz|
|Technology||45 nm||45 nm||Technology||12 nm||16 nm|
Benchmarks. We use ten diverse CNN and RNN models to evaluate , described in Table 2 that perform image classification, real-time object detection (YOLOv3), and character-level (PTB-RNN) and word-level (PTB-LSTM) language modeling. This set of benchmarks includes medium to large scale models (from 11.1 MBytes to 137.3 MBytes) and variety of multiply-add operations (from 13 Million to 39 Billion).
Simulation infrastructure. We develop a cycle-accurate simulator and a compiler for that takes in a caffe-2 specification of the DNN, finds the optimum tiling and cutting for each layer, and maps it to architecture. The simulator executes each of the optimized network using the architecture model and reports the total runtime and energy.
comparison. We compare with , a state-of-the-art fully-digital 3D-stacked dataflow accelerator. We match the on-chip power dissipation of and and compare the total runtime and energy, including energy for DRAM accesses. We also perform an iso-area comparison and scale up original with 16 vaults to 36 vaults to match its area to ’s. The baseline supports 16-bit execution while supports 8-bit. For fairness, we modify the open-source simulator  and proportionally scale its runtime and energy. supports 8-bit operands since this representation has virtually no impact by itself on the final accuracy of the DNNs [59, 66, 67, 68, 69].
GPU comparison. We also compare to two Nvidia GPUs (i.e., RTX 2080 TI and Titan Xp) based on Turing and Pascal architecture respectively, listed in Table 3
. RTX 2080 TI’s Turing architecture provides tensor cores which are specialized hardware for deep learning inference. We use 8-bit on GPUs using Nvidia’s own TensorRT 5.1 library compiled with the optimized cuDNN 7.5 and CUDA 10.1. For each DNN benchmark, we perform 1,000 warmup iterations and report the average runtime across 10,000 iterations.
Comparison with other recent accelerators. We also compare to Google TPU , mixed-signal CMOS RedEye , and two analog memristive accelerators. All the comparisons are in 8-bits. The original designs [32, 71] use 16-bits. Scaling from 16-bit to 8-bit execution for memristive designs would optimistically provide a increase in efficiency.
Energy and area measurement. All hardware modelings are performed using FreePDK 45-nm standard cell library . We implement the switched-capacitor MACCs in Cadence Analog Design Environment V6.1.3 and use Spectre SPICE V6.1.3 to model the system. We then, use Layout XL of Cadence to lay out the MACC units and extract the energy/area. The ADC’s energy/area are obtained from . Based on the configuration, we use the ADC architecture from .
We implement all digital blocks of , including adders, shifters, interconnection, and accumulators in Verilog RTL and used Synopsys Design Compiler (L-2016.03-SP5) to synthesize them and measure their energy and area. For on-chip SRAM buffers, we use CACTI-P  to measure the energy and area of the memory blocks. The 3D-stacked DRAM architecture is based on HMC stack [49, 50], the same as , and the bandwidth and access energy are adopted form that work.
Error modeling. For error modeling, we use Spectre SPICE V6.1.3 to extract the noise behavior of MACCs via circuit simulations. Thermal noise, computational error, and PVT variations are considered based on details in Section 5
. We implement the extracted hardware error models and the corresponding mathematical modelings using PyTorch v1.0.1 and integrate them into Neural Network Distiller v0.3 framework  for a fine-tuning pass over the evaluated benchmarks.
8.2 Experimental Results
8.2.1 Comparison with
Iso-power performance and energy comparison. Figure 8 shows the performance and energy reduction of over under the same on-chip power budget. On average, delivers a speedup over . This significant improvement is attributed to the use of wide mixed-signal in as opposed to PEs in . The wide bit-partitioned mixed-signal design of in enables us to cram 5 more compute units within the same power budget as . The highest speedup is observed in YOLOv3 and PTB-RNN, where their networks’ configurations favor the wide vectorized execution in by better utilizing compute resources. The lowest speedup is observed in ResNet-18, since its relatively small size leads to under-utilization of compute resources in .
Figure 8 demonstrates the total energy reduction for across the evaluated benchmarks as compared to . On average, yields energy reduction over , including energy for DRAM accesses, while consuming the same on-chip power as . CIFAR-10 enjoys the highest energy reduction, since is able to take advantage of CIFAR-10’s smaller memory footprint to maximize on-chip data reuse and reduce DRAM accesses. The lowest energy reduction is observed in RNN benchmarks, PTB-RNN and PTB-LSTM since the matrix-vector operations in these benchmarks require a significant number of memory accesses, diminishing the benefits from mixed-signal computations.
shows the energy breakdown normalized to . Energy breakdown is reported across four major architectural components: (1) on-chip compute units, (2) on-chip memory (buffers and register file), (3) interconnect, and (4) 3D-stacked DRAM. DRAM accesses account for the highest portion of the energy in , since significantly reduces the on-chip compute energy. While has a significantly larger number of compute resources compared to , the number of DRAM accesses remain almost the same. This is because the statically-scheduled bus allows data to be multicasted/broadcasted across multiple cores in without significantly increasing the number of DRAM accesses. Furthermore, the statically-scheduled bus offers the compiler stack the freedom to optimize partitioning the computations across cores. Most layers in the benchmarks benefit for partitioning the different inputs in a single batch (batch size is 16) across cores and broadcasting weights, which is not explored in . As a result, these networks have lower DRAM accesses. The breakdown of energy consumption varies with the type of computations required by the DNN as well as the degree of data-reuse. Benchmarks PTB-RNN and PTB-LSTM are recurrent neural networks that perform large matrix-vector operations and require significant DRAM accesses for weights. Therefore, PTB-RNN and PTB-LSTM use more energy for DRAM accesses compared to other benchmarks.
Unlike the fully-digital PEs in that perform a single operation in a cycle, uses which perform wide vectorized operations–crucial in to amortize the high cost of ADCs. As shown in Table 1, each MACC operation in consumes 5.4 less energy compared to . The output-stationary dataflow enabled by capacitive accumulation in addition to the systolic organization of in each core of which eliminates the need for register files unlike , leads to 4.4 reduction for on-chip data movement on average.
Iso-area comparison with . We compare the total runtime and energy of with a scaled up version of which matches ’area. Figure 10 shows the results for the workloads. Scaling-up the compute resources in by 2.25 to match the chip-area of results in a sub-linear increase in performance by . This improvement in performance comes at a cost of reduced energy-efficiency due to an increase in memory accesses to feed the additional compute resources. The trends in speedup and energy-reduction remain the same as iso-power comparison, with the exception of ResNet-18, which now sees resource underutilization in after scaling up number of compute resources.
8.2.2 Comparison to GPUs
Figure 11 compares performance of with Titan Xp and RTX 2080 TI. RTX 2080 TI is based on Nvidia’s latest architecture, Turing. For a fair comparison, we enable vectorized 8-bit operations and optimized GPU compilations. The results are normalized to Titan Xp. , on average, yields 70% speedup over Titan Xp GPU and performs 15% slower than RTX 2080 TI. Convolutional networks require large amount of matrix-matrix multiplications that are well-suited for tensor cores, leading to RTX 2080 TI’s outperformance on both and Titan Xp. VGG-16 and VGG-19 see the maximum benefits. However, outperforms RTX 2080 TI GPU in PTB-RNN and PTB-LSTM with 11.2 and 10.6, respectively. These RNN networks require matrix-vector multiplications, which is particularly suitable for the wide vectorized operations supported in ’s –not the best case for tensor cores. In terms of performance-per-Watt, outperforms both Titan Xp and RTX 2080 TI GPUs by a large margin, and , respectively.
8.2.3 Comparison with Other Accelerators
We also compare the power efficiency (GOPS/s/Watt) and area efficiency GOPS/s/ of with other recent digital and analog accelerators. Due to the lack of available raw performance/energy numbers for specific workloads, we use these metrics that is commensurate with comparisons for recent designs [21, 71, 78]. Figure 12 depicts the peak power and area efficiency results.
On average for the evaluated benchmarks, achieves 72% of its peak efficiency. This information is not available in the publications for the other designs.
Digital systolic: Google TPU . In comparison with TPU, which also uses systolic design, delivers 4.5 more peak power efficiency and almost the same area efficiency. Leveraging the wide, interleaved, and bit-partitioned arithmetic with its switched-capacitor implementation in architecture, reduces the cost of MACC operations significantly compared with TPU which uses 8-bit digital logic, leading to significant improvement in power efficiency.
Mixed-signal CMOS: RedEye . RedEye is an in-sensor CNN accelerator baed on mixed-signal CMOS technology which also uses switched-capacitor circuitry for MACC operations. Compared to RedEye, offers 5.5 better power efficiency and 167 better area efficiency. Utilizing the proposed wide, interleaved, and bit-partitioned arithmetic amortizes the cost of ADC in by reducing its required resolution and sampling rate, leading to significant curtailment of ADC power and area, in contrast to RedEye.
Analog Memristive designs [32, 71]. Prior work in ISAAC  and PipeLayer  have explored analog memristive technology for DNN acceleration, which integrates both compute and storage within the same die, and offers higher compute density compared to traditional analog CMOS technology. However, this increase in compute density comes at the cost of reduced power-efficiency. Generally, memrisitive designs perform computations in the current domain, requiring the costly ADCs to sample the current-domain signals at the same rate as the compute/storage for memristors. PipeLayer significantly reduces this cost. Overall, compared to ISAAC and PipeLayer, improves the power efficiency by 3.6 and 9.6, respectively.
8.2.4 Design Space Explorations
Design space exploration for bit-partitioning.
To evaluate the effectiveness of bit-partitioning, we perform a design space exploration with various bit-partitioned options. Figure 13 shows the reduction in energy and area compared to an 8-bit8-bit design when two vectors with 32 elements go under dot-product. The other design points also perform 8-bit8-bit MACC operations while utilizing our wide and interleaved bit-partitioned arithmetic. As depicted, the design with 2-bit partitioning strikes the best balance in energy and area with the switched-capacitor design of MACC units at 45 nm CMOS node. The difference between 2-bit and 1-bit is that single-bit partitioning quadratically increases the number of low bitwidth MACCs from 16 (2-bit partitioning ) to 64 (1-bit partitioning) to support 8-bit operations. This imposes disproportionate overhead that outweighs the benefit of decreasing each MACC units area and energy.
Design space exploration for configuration.
The number of accumulation cycles () before the A/D conversion and the number of MACC units () are two main parameters of which define resolution and the sample rate of the ADC, determining its power. Figure 14 shows the design space exploration for different configurations of the . In a fixed power budget of W for compute units, we measure the total runtime and energy of over the evaluated workloads which are normalized to . As shown in Figure 14, increasing number of MACCs, limits the number of accumulation cycles, consequently leading to using ADCs with high sample-rates. Using high sample-rate ADCs significantly increases power, making the design less efficient. On the other hand, increasing number of accumulation cycles, limits the number of MACCs, which restricts the number of that can be integrated into the design under the given power budget. Overall, the optimal design point that delivers the best performance and energy is with eight MACC units and 32 accumulation cycles.
Design space exploration for clustered architecture.
uses a hierarchical architecture with multiple cores in each vault. Having a larger number of small cores for each vault yields increased utilization of compute resources, but requires data transfer across cores. We explore the design space with 1, 2, 4, and 8 cores per cluster.As Figure15 shows, with four cores per each vault (default configuration in ) strikes the best balance between speedup and energy reduction. Performance increases as we increase the number of cores per vault from 1 to 8. However, the 8-core configuration results in a higher number of data accesses. Therefore, the 4-core design point provides the optimal balance.
8.2.5 Evaluation of Circuitry Non-Idealities
Table 4 shows the Top-1 accuracy with considering non-idealities, after fine-tuning, the ideal accuracy, and the final loss in accuracy.
|PTB-RNN||Penn TreeBank||1.1 BPC||1.6 BPC||1.1 BPC||0.0 BPC|
|PTB-LSTM||Penn TreeBank||97 PPW||170 PPW||97 PPW||0.0 PPW|
As shown in Table 4, some of the networks, namely AlexNet and ResNet-18, are more sensitive to the non-idealities, leading to a higher initial accuracy degradation. To recover the accuracy loss due to the circuitry non-idealities, we perform a fine-tuning step for a few epochs. By performing this fine-tuning step, the accuracy loss of the CIFAR-10, ResNet-18, and ResNet-50 networks is fully recovered (loss is less than 0.04%) which within these networks, CIFAR-10 and ResNet-50 are more robust to non-idealities. The accuracy loss for other networks is below 0.5% which within those AlexNet has the maximum loss. The final two networks, namely PTB-RNN and PTB-LSTM perform character-level and word-level language modeling, respectively. The accuracy for these two networks is measured in Bits-Per-Character (BPC) and Perplexity-per-Word (PPW), respectively. Both PTB-RNN and PTB-LSTM recover all the loss after fine-tuning. The final results after fine-tuning step show the effectiveness of this approach in recovering the accuracy loss due to the non-idealities pertinent to analog computation.
9 Related Work
There is a large body of work on digital accelerators for DNNs [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Mixed-signal acceleration has also been explored previously for neural network [34, 40] and is gaining traction again for the deep models [32, 33, 35, 36, 37, 38, 39, 41, 42]. This paper fundamentally differs from these inspiring efforts as it delves into the mathematics of basic operations in DNNs, reformulates and defines the wide, interleaved, and bit-partitioned approach to overcome the challenges of mixed-signal acceleration. By partitioning and re-aggregating the low-bitwidth MACC operations, this paper addresses the limited range of encoding and reduces the cost of cross-domain conversions. Additionally, it combines the proposed mathematical reformulation with switched-capacitor circuitry to share and delay A/D conversions, which amortizes their cost and reduce their rate, respectively. Below, we discuss the most related works.
Switched-capacitor design. Switched-capacitor circuits  have a long history, having been mainly used for designing amplifiers, A/D and D/A converters and filters. Similar to resistive circuits, they have been used even for the previous generation of neural networks . More recently, they have also been used for matrix multiplication[82, 42], which can benefit DNNs. This work takes inspiration from these efforts but differes from them in that it defines and leverages wide, interleaved, and bit-partitioned reformulation of DNN operations. Additionally, it offers a comprehensive architecture that can accelerate a wide variety of DNNs.
Programmable mixed-signal accelerators. PROMISE  offers a mixed-signal architecture that integrates analog units within the SRAM memory blocks. RedEye is a low-power near-sensor mixed-signal accelerator that uses charge-domain computations. These works do not offer wide interleavings of bit-partitioned basic operations as described in this paper.
or binarized mixed-signal acceleration of CIFAR-10 images. Another work focuses on spiking neural networks’ acceleration . In contrast, our design is programmable and supports interleaved bit-partitioning.
Resistive memory accelerators. There is a large body of work using resistive memory [32, 71, 78, 84, 85, 86, 87, 88]. We provided a direct comparison to ISAAC  and PipeLayer . ISAAC  most notably introduces the concept of temporally bit-serial operations, also explored in PRIME , and is augmented with the concept of spike-base data scheme in PipeLayer . , in contrast, formulates a partitioning that spatially groups lower-bitwidth MACCs across different vector elements and performs them in-parallel. PRIME does not provide absolute measurements and its simulated baseline is not available for a head-to-head comparison. PRIME also uses multiple truncations that change the mathematics. Conversely, our formulation does not induce truncation or mathematical changes.
This work proposes wide, interleaved, and bit-partitioned arithmetic to overcome two key challenges in mixed-signal acceleration of DNNs: limited encoding range, and costly A/D conversions. This bit-partitioned arithmetic enables rearranging the highly parallel MACC operations in modern DNNs into wide low-bitwidth computations that are mapped efficiently to mixed-signal units. Further, these units operate in charge domain using switched-capacitor circuitry and reduce the rate of A/D conversions by accumulating partial results in the charge domain. The resulting microarchitecture, named , offers significant benefits over its state-of-the-art analog and digital counterparts. These encouraging results suggest that the combination of mathematical insights with architectural innovations can enable new avenues in DNN acceleration.
- Niehues et al.  J. Niehues, N.-Q. Pham, T.-L. Ha, M. Sperber, and A. Waibel. Low-Latency Neural Speech Translation. ArXiv e-prints, August 2018.
- Mo and Sattar  J. Mo and J. Sattar. SafeDrive: Enhancing Lane Appearance for Autonomous and Assisted Driving Under Limited Visibility. ArXiv e-prints, July 2018.
- Li et al.  R. Li, Y. Shu, J. Su, H. Feng, and J. Wang. Using deep Residual Network to search for galaxy-Lyalpha emitter lens candidates based on spectroscopic-selection. ArXiv e-prints, July 2018.
Rohde et al. 
D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou.
RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising.ArXiv e-prints, August 2018.
- Grabec et al.  I. Grabec, E. Švegl, and M. Sok. Development of a sensory-neural network for medical diagnosing. ArXiv e-prints, July 2018.
- Esmaeilzadeh et al.  Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. In ISCA, 2011.
- Hardavellas et al.  N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6–15, July–Aug. 2011.
- Venkatesh et al.  Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation cores: Reducing the energy of mature computations. In ASPLOS, 2010.
Zhang et al. [2015a]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.
Optimizing fpga-based accelerator design for deep convolutional neural networks.In FPGA, 2015a.
- Esmaeilzadeh et al.  Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. to apear in Commun. ACM, 2013.
Chen et al. [2014a]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,
Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.
Dadiannao: A machine-learning supercomputer.In MICRO, 2014a.
- Gao et al. [2017a] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. In ASPLOS, 2017a.
- Delmas et al.  Alberto Delmas, Sayeh Sharify, Patrick Judd, and Andreas Moshovos. Tartan: Accelerating fully-connected and convolutional layers in deep learning networks by exploiting numerical precision variability. arXiv, 2017.
- Mahajan et al.  Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In HPCA, 2016.
- Zhang et al.  Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In MICRO, 2016.
- Albericio et al.  Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. Cnvlutin: ineffectual-neuron-free deep neural network computing. In ISCA, 2016.
- Judd et al.  Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. Stripes: Bit-serial deep neural network computing. In MICRO, 2016.
- Sharma et al.  Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kim, Chenkai Shao, Asit Misra, and Hadi Esmaeilzadeh. From high-level deep neural models to fpgas. In MICRO, 2016.
- Chung et al.  Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Christian Boehn, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Steve Reinhardt, Adam Sapek, Raja Seera, Balaji Sridharan, Lisa Woods, Phillip Yi-Xiao, Ritchie Zhao, and Doug Burger. Accelerating persistent neural networks at datacenter scale. In HotChips, 2017.
- Parashar et al.  Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In ISCA, 2017.
- Andri et al.  Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An ultra-low power convolutional neural network accelerator based on binary weights. arXiv, 2016.
- Han et al.  Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In ISCA, 2016.
- Chen et al.  Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ISCA, 2016.
- Chen et al. [2017a] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. JSSC, 2017a.
- Kim et al.  Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 380–392. IEEE, 2016.
- Jouppi et al.  Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In ISCA, 2017.
- Chen et al. [2014b] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014b.
-  Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks.
- Aklaghi et al.  Vahide Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, Hadi Esmaeilzadeh, and Rajesh K. Gupta. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In ISCA, 2018.
- Hegde et al.  Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. arXiv preprint arXiv:1804.06508, 2018.
- Lee et al.  Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo. Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision. In ISSCC, 2018.
- Shafiee et al.  Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In ISCA, 2016.
- Srivastava et al.  Prakalp Srivastava, Mingu Kang, Sujan K Gonugondla, Sungmin Lim, Jungwook Choi, Vikram Adve, Nam Sung Kim, and Naresh Shanbhag. Promise: An end-to-end design of a programmable mixed-signal accelerator for machine-learning algorithms. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
- Tsividis and Anastassiou  YP Tsividis and D Anastassiou. Switched-capacitor neural networks. Electronics Letters, 23(18):958–959, 1987.
- LiKamWa et al.  Robert LiKamWa, Yunhui Hou, Julian Gao, Mia Polansky, and Lin Zhong. Redeye: analog convnet image sensor architecture for continuous mobile vision. In ACM SIGARCH Computer Architecture News, volume 44, pages 255–266. IEEE Press, 2016.
- Bankman and Murmann  Daniel Bankman and Boris Murmann. Passive charge redistribution digital-to-analogue multiplier. Electronics Letters, 51(5):386–388, 2015.
- Lee and Wong [2017a] E. H. Lee and S. S. Wong. Analysis and design of a passive switched-capacitor matrix multiplier for approximate computing. IEEE Journal of Solid-State Circuits, 52(1):261–271, Jan 2017a. ISSN 0018-9200. doi: 10.1109/JSSC.2016.2599536.
- Bankman et al.  Daniel Bankman, Lita Yang, Bert Moons, Marian Verhelst, and Boris Murmann. An always-on 3.8 j/86% cifar-10 mixed-signal binary cnn processor with all memory on chip in 28nm cmos. In Solid-State Circuits Conference-(ISSCC), 2018 IEEE International, pages 222–224. IEEE, 2018.
- Buhler et al.  Fred N Buhler, Peter Brown, Jiabo Li, Thomas Chen, Zhengya Zhang, and Michael P Flynn. A 3.43 tops/w 48.9 pj/pixel 50.1 nj/classification 512 analog neuron sparse coding neural network with on-chip learning and classification in 40nm cmos. In VLSI Circuits, 2017 Symposium on, pages C30–C31. IEEE, 2017.
- St. Amant et al.  Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014.
Zhang et al. [2015b]
Jintao Zhang, Zhuo Wang, and Naveen Verma.
18.4 a matrix-multiplying adc implementing a machine-learning classifier directly with data conversion.In Solid-State Circuits Conference-(ISSCC), 2015 IEEE International, pages 1–3. IEEE, 2015b.
- Lee and Wong [2017b] Edward H Lee and S Simon Wong. Analysis and Design of a Passive Switched-Capacitor Matrix Multiplier for Approximate Computing. IEEE Journal of Solid-State Circuits, 52(1):261–271, 2017b.
- Gray et al.  Paul R Gray, Paul Hurst, Robert G Meyer, and Stephen Lewis. Analysis and design of analog integrated circuits. Wiley, 2001.
- Chi et al. [2016a] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ISCA, 2016a.
- Sharify et al.  Sayeh Sharify, Alberto Delmas Lascorz, Patrick Judd, and Andreas Moshovos. Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks. arXiv, 2017.
- Gao et al. [2017b] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. https://github.com/stanford-mast/nn_dataflow, 2017b.
- Li and Pedram  Yuanfang Li and Ardavan Pedram. Caterpillar: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. In Application-specific Systems, Architectures and Processors (ASAP), 2017 IEEE 28th International Conference on, pages 1–10. IEEE, 2017.
- Upadhyay and Roy Chowdhury  Himani Upadhyay and Shubhajit Roy Chowdhury. A high speed and low power 8 bit x 8 bit multiplier design using novel two transistor (2t) xor gates. Journal of Low Power Electronics, 01 2015. doi: 10.1166/jolpe.2015.1362.
- Consortium et al.  Hybrid Memory Cube Consortium et al. Hybrid memory cube specification 1.0. Last Revision Jan, 2013.
- Jeddeloh and Keeth  Joe Jeddeloh and Brent Keeth. Hybrid memory cube new dram architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on, pages 87–88. IEEE, 2012.
- Yazdanbakhsh et al.  Amir Yazdanbakhsh, Hajar Falahati, Philip J. Wolfe, Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung Kim. GANAX: A Unified SIMD-MIMD Acceleration for Generative Adversarial Network. In ISCA, 2018.
- Ismail and Fiez  Mohammed Ismail and Terri Fiez. Analog VLSI: signal and information processing, volume 166. McGraw-Hill New York, 1994.
- Tripathi and Murmann  Vaibhav Tripathi and Boris Murmann. Mismatch characterization of small metal fringe capacitors. IEEE Transactions on Circuits and Systems I: Regular Papers, 61(8):2236–2242, 2014.
- Eckert et al.  Yasuko Eckert, Nuwan Jayasena, and Gabriel H Loh. Thermal feasibility of die-stacked processing in memory. 2014.
-  Facebook AI Research. Caffe2. https://caffe2.ai/.
- Krizhevsky  Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv, 2014.
- Deng et al.  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. URL http://image-net.org/.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv, 2014.
- Hubara et al.  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv, 2016.
- Krizhevsky and Hinton  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Redmon and Farhadi  Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- Marcus et al.  Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 1993.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
- Zhou et al.  Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016.
- Mishra et al.  Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. WRPN: wide reduced-precision networks. arXiv, 2017.
- Li et al.  Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv, 2016.
- Zhang et al.  Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. arXiv preprint arXiv:1807.10029, 2018.
-  Nvidia tensor rt 5.1. https://developer.nvidia.com/tensorrt.
- Song et al.  Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined reram-based accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pages 541–552. IEEE, 2017.
- NCSU  NCSU. Freepdk45, 2018. URL https://www.eda.ncsu.edu/wiki/FreePDK45.
-  B. Murmann. ADC Performance Survey 1997-2016. murmann/adcsurvey.html, [Online]. Available. URL http://web.stanford.edu/.
- Harpe  Pieter Harpe. A 0.0013 mm2 10b 10ms/s sar adc with a 0.0048 mm2 42db-rejection passive fir filter. In 2018 IEEE Custom Integrated Circuits Conference, CICC 2018. Institute of Electrical and Electronics Engineers Inc., 2018.
- Li et al.  S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In ICCAD, 2011.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Zmora et al.  Neta Zmora, Guy Jacob, and Gal Novik. Neural network distiller, June 2018. URL https://doi.org/10.5281/zenodo.1297430.
- Long et al.  Yun Long, Taesik Na, and Saibal Mukhopadhyay. Reram-based processing-in-memory architecture for recurrent neural network acceleration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (99):1–14, 2018.
- Crols and Steyaert  Jan Crols and Michel Steyaert. Switched-opamp: An approach to realize full cmos switched-capacitor circuits at very low power supply voltages. IEEE Journal of Solid-State Circuits, 29(8):936–942, 1994.
- Fiorenza et al.  John K Fiorenza, Todd Sepke, Peter Holloway, Charles G Sodini, and Hae-Seung Lee. Comparator-based switched-capacitor circuits for scaled cmos technologies. IEEE Journal of Solid-State Circuits, 41(12):2658–2668, 2006.
- Brodersen et al.  Robert W Brodersen, Paul R Gray, and David A Hodges. Mos switched-capacitor filters. Proceedings of the IEEE, 67(1):61–75, 1979.
- Bankman and Murmann  Daniel Bankman and Boris Murmann. An 8-bit, 16 input, 3.2 pj/op switched-capacitor dot product circuit in 28-nm fdsoi cmos. In Solid-State Circuits Conference (A-SSCC), 2016 IEEE Asian, pages 21–24. IEEE, 2016.
- Miyashita et al.  Daisuke Miyashita, Shouhei Kousai, Tomoya Suzuki, and Jun Deguchi. A neuromorphic chip optimized for deep learning and cmos technology with time-domain analog and digital mixed-signal processing. IEEE Journal of Solid-State Circuits, 52(10):2679–2689, 2017.
- Qiao et al.  Ximing Qiao, Xiong Cao, Huanrui Yang, Linghao Song, and Hai Li. Atomlayer: a universal reram-based cnn accelerator with atomic layer computation. In Proceedings of the 55th Annual Design Automation Conference, page 103. ACM, 2018.
- Ji et al.  Houxiang Ji, Linghao Song, Li Jiang, Hai Halen Li, and Yiran Chen. Recom: An efficient resistive accelerator for compressed deep neural networks. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pages 237–240. IEEE, 2018.
- Li et al.  Bing Li, Linghao Song, Fan Chen, Xuehai Qian, Yiran Chen, and Hai Helen Li. Reram-based accelerator for deep learning. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pages 815–820. IEEE, 2018.
- Chen et al. [2017b] Lerong Chen, Jiawen Li, Yiran Chen, Qiuping Deng, Jiyuan Shen, Xiaoyao Liang, and Li Jiang. Accelerator-friendly neural-network training: learning variations and defects in rram crossbar. In Proceedings of the Conference on Design, Automation & Test in Europe, pages 19–24. European Design and Automation Association, 2017b.
- Chi et al. [2016b] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH Computer Architecture News, volume 44, pages 27–39. IEEE Press, 2016b.