1 Introduction
Deep Neural Networks (DNNs) are revolutionizing a wide range of services and applications such as language translation [1], transportation [2], intelligent search [3], ecommerce [4], and medical diagnosis [5]. These benefits are predicated upon delivery on performance and energy efficiency from hardware platforms. With the diminishing benefits from generalpurpose processors [6, 7, 8, 9], there is an explosion of digital accelerators for DNNs [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Mixedsignal acceleration [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42] is also gaining traction. Albeit lowpower, mixedsignal circuitry suffers from limited range of information encoding, is susceptible to noise, imposes Analog to Digital (A/D) and Digital to Analog (D/A) conversion overheads, and lacks finegrained control mechanism. Realizing the full potential of mixedsignal technology requires a balanced design that brings mathematics, architecture, and circuits together.
This paper sets out to explore this conjunction of areas by inspecting the mathematical foundation of deep neural networks. Across a wide range of models, the large majority of DNN operations belong to convolution and fullyconnected layers [23, 28, 32]. Consequently, based on Amdahl’s Law, our architecture executes these two types of layers in the mixedsignal domain. Nevertheless, to maintain generality for the everexpanding roster of other layers required by modern DNNs, the architecture handles the other layers digitally. Normally, the convolution and fullyconnected layers are broken down into a series of vector dotproducts, that generate a scalar and comprise a set of MultiplyAccumulate (MACC) operations. Stateoftheart digital [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] and mixedsignal [32, 33, 35, 36, 37, 38, 39, 40, 34, 43, 41, 42] accelerators use a large array of standalone MACC units to perform the necessary computations. When moving to the mixedsignal domain, this standalone arrangement of MACC operations imposes significant overhead in the form of A/D and D/A conversions for each operation. The root cause is the high cost of converting the operands and outputs of each MACC to and from the analog domain, respectively.
This paper aims to address the aforementioned list of challenges by making the following three contributions.
(1) This work offers and leverages the insight that the set of MACC operations within a vector dotproduct can be partitioned, rearranged, and interleaved at the bit level without affecting the mathematical integrity of the vector dotproduct. Unlike prior work [33, 42, 44], this work does not rely on changing the mathematics of the computation to enable mixedsignal acceleration. Instead, it only rearranges the bitwise arithmetic calculations to utilize lower bitwidth analog units for higher bitwidth operations. The key insight is that a binary value can be expressed as the sum of products similar to dotproduct, which is also a sum of multiplications (). Value can be expressed as where s are the individual bits or as , where s are 4bit partitions for instance. Our interleaved bitpartitioned arithmetic effectively utilizes the distributive and associative property of multiplication and addition at the bit granularity.
The proposed model, first, bitpartitions all elements of the two vectors, and then distributes the MACC operations of the dotproduct over the bit partitions. Therefore, the lower bitwidth MACC becomes the basic operator that is applied to each bitpartition. Then, our mathematical formulation exploits the associative property of the multiply and add to group bitpartitions that are at the same significance position. This significancebased rearrangement enables factoring out the poweroftwo multiplicand that signifies the position of the bitpartitions. The factoring enables performing the wide groupbased lowbitwidth MACC operations simultaneously as a spatially parallel operation in the analog domain, while the group shares a single A/D convertor. The poweroftwo multiplicand will be applied later in the digital domain to the accumulated result of the group operation. To this end, we rearchitect vector dotproduct as a series of wide (across multiple elements of the two vectors), interleaved and bitpartitioned arithmetic and reaggregation. Therefore, our reformulation significantly reduces the rate of costly A/D conversion by rearranging the bitlevel operations across the elements of the vector dotproduct. Using lowbitwidth operands for analog MACCs provides a larger headroom between the value encoding levels in the analog domain. The headroom leads tackles the limited range of encoding and offers higher robustness to noise, an inherent nonideality in the analog mode. Additionally, using lower bitwidth operands reduces the energy/area overhead imposed by A/D and D/A convertors that roughly scales exponentially with the bitwidth of operands.
(2) At the circuit level, the accelerator is designed using switchedcapacitor circuitry that stores the partial results as electric charge over time without conversion to the digital domain at each cycle. The lowbitwidth MACCs are performed in charge domain with a set of chargesharing capacitors. This design choice lowers the rate of A/D conversion as it implements accumulation as a gradual storage of charge in a set of parallel capacitors. These capacitors not only aggregate the result of a group of lowbitwidth MACCs, but also enable accumulating results over time. As such, the architecture enables dividing the longer vectors into shorter subvectors that are multiplyaccumulated over time with a single group of lowbitwidth MACCs. The results are accumulated over multiple cycles in the group’s capacitors. Because the capacitors can hold the charge from cycle to cycle, the A/D conversion is not necessary in each cycle. This reduction in rate of A/D conversion is in addition to the amortized cost of A/D convertors across the bitpartitioned analog MACCs of the group.
(3) Based on these insights, we devise a clustered 3Dstacked microarchitecture, dubbed , that provides the capability to integrate copious number of lowbitwidth switchedcapacitor MACC units that enables the interleaved bitpartitioned arithmetic. The lower energy of mixedsignal computations offers the possibility of integrating a larger number of these units compared to their digital counterpart. To efficiently utilize the more sizable number of compute units, a higher bandwidth memory subsystem is needed. Moreover, one of the large sources of energy consumption in DNN acceleration is offchip DRAM accesses [30, 28, 23]. Based on these insights, we devise a clustered architecture for that leverages 3Dstacking for its higher bandwidth and lower data transfer energy.
Evaluating the carefully balanced design of with ten DNN benchmarks shows that delivers speedup over the leading purely digital 3Dstacked DNN accelerator, [12], with only 0.5% loss in accuracy achieved after mitigating noise, computation error, and ProcessVoltageTemperature (PVT) variations. With 8bit execution, offers and higher PerformanceperWatt compared to RTX 2080 TI and Titan Xp, respectively. With these benefits, this paper marks an initial effort that paves the way for a new shift in DNN acceleration.
2 Wide, Interleaved, and BitPartitioned Arithmetic
A key idea of this work is the mathematical insight that enables utilizing low bitwidth mixedsignal units in spatially parallel groups. This section demonstrates this insight.
BitLevel partitioning and interleaving of MACCs. To further detail the proposed mathematical reformulation, Figure 1(a) delves into the bitlevel operations of dotproduct on vectors with 2elements containing 4bit values. As illustrated with different colors, each 4bit element can be written in the form of sum of 2bit partitions multiplied by powers of 2 (shift). As discussed, vector dotproduct is also a sum of multiplications. Therefore, by utilizing the distributive property of addition and multiplication, we can rewrite the vectordot product in terms of the bit partitions. However, we also leverage the associativity of the addition and multiplication to group the bitpartitions in the same positions together. For instance, in Figure 1, the black partitions that represent the Most Significant Bits (MSBs) of the vector are multiplied in parallel to the teal^{2}^{2}2Color teal in Figure 1 is the darkest gray in black and white prints. partitions, representing the MSBs of the . Because of the distributivity of multiplication, the shift amount of (2+2) can be postponed after the bitpartitions are multiplyaccumulated. The different colors of the boxes in Figure 1 illustrates the interleaved grouping of the bitpartitions. Each group is a set of spatially parallel bitpartitioned MACC operations that are drawn from different elements of the two vectors. The lowbitwidth nature of these operations enables execution in the analog domain without the need for A/D conversion for each individual bitpartitioned operation. As such, our proposed reformulation amortizes the cost of A/D conversion across the bitpartitions of different elements of the vectors as elaborated below.
Wide, interleaved, and bitpartitioned vector dotproduct. Figure 1(b) illustrates the proposed vector dotproduct operation with 4bit elements that are bit partitioned to 2bit subelements. For instance, as illustrated, the elements of vector , denoted as , are first bit partitioned to and . The former represents the two Least Significant Bits (LSBs) and the latter represents the Most Significant Bits (MSBs). Similarly, the elements of vector are also bit partitioned to the and subelements. Then, each vector (e.g., ) is rearranged into two bitpartitioned subvectors, and . In the current implementations of architecture, the size of bitpartition is fixed across the entire architecture. Therefore, the rearrangement is just rewiring the bits to the compute units that imposes modestly minimal overhead (less than 1%). Figure 1 is merely an illustration and there is no need for extra storage or movement of elements. As depicted with color coding, after the rewiring, represents all the least significant bitpartitions from different elements of vector , while the MSBs are rewired in . The same rewiring is repeated for the vector . This rearrangement, puts all the bitpartitions from all the elements of the vectors with the same significance in one group, denoted as , , , . Therefore, when a pair of the groups (e.g., and in Figure 1(c)) are multiplied to generate the partial products, (1) the shift amount (“” in this case) is the same for all the bitpartitions and (2) the shift can be done after partial products from different subelements are accumulated together.
As shown in Figure 1(c), the lowbitwidth elements are multiplied together and accumulated in the analog domain. Accumulation in the digital domain would require an adder tree which is costly compared to the analog accumulation that merely requires connectivity between the multiplier outputs. It is only after several analog multiplyaccumulations that the results are converted back to digital for shift and aggregation with partial products from the other groups. The size of the vectors usually exceeds the number of parallel lowbitwidth MACCs, in which case the results need to be accumulated over multiple iterations. As will be discussed in the next section, the accumulations are performed in two steps. The first step accumulates the results in the analog domain through charge accumulation in capacitors before A/D convertors (see Figure 1(c)). In the second step, these converted accumulations will be added up in the digital domain using a register. For this pattern of computation, we are effectively utilizing the distributive and associative property of multiplication and addition for dotproduct but at the bit granularity. This rearrangement and spatially parallel (i.e., wide) bitpartitioned computation is in contrast with temporally bitserial digital [17, 13, 31, 45] and analog [32] DNN accelerators.
The next section describes the architecture of the mixedsignal accelerator that leverages our mathematical reformulation. This architecture is essentially a collection of the structure that is depicted in Figure 1(c). The structure is the MixedSignal Wide Aggregator () that spatially aggregates the results from its four units as illustrated. Each of these four units, which are also wide, is a MixedSignal BitPartitioned MACC (). Note that the number of in a is a function of the bitwidth of the vector elements and the value of bitpartitioning.
3 MixedSignal Architecture Design for Wide BitPartitioning
To exploit the aforementioned arithmetic, comes with a mixedsignal building block that performs wide bitpartitioned vector dotproduct. then organizes these building blocks in a clustered hierarchical design to efficiently make use of its copious number of parallel lowbitwidth mixedsignal MACC units. The clustered design is crucial as mixedsignal paradigm enables integrating a larger number of parallel operators than the digital counterpart.
3.1 Wide BitPartitioned MixedSignal MACC
As Figure 2(a) shows, the building block of is a collection of lowbitwidth analog MACCs that operate in parallel on subelements from the two vectors under dotproduct. This wide structure is dubbed . We design the lowbitwidth MACCs using switchedcapacitor circuitry for the following reason. This design choice lowers the rate of A/D conversion as it implements accumulation as a gradual storage of charge in a set of parallel capacitors. These capacitors not only aggregate the results of lowbitwidth MACCs, but also enable accumulating results over time. As such, longer vectors are divided into shorter subvectors that are multiplyaccumulated over time without the need to convert the intermediate results back to the digital domain. It is only after processing multiple subvectors that the accumulated result is converted to digital, significantly reducing the rate of costly A/D conversions. As shown in Figure 2(a), each lowbitwidth MACC unit is equipped with its own pair of local capacitors, which perform the accumulation over time across multiple subvectors. As will be discussed in Section 4, the pair is used to handle positive and negative values by accumulating them separately on one or the other capacitor. After a predetermined number of private accumulations in the analog domain, the partial results need to be accumulated across the lowbitwidth MACCs. In that cycle, the transmission gates between the capacitors (Figure 2(a)) connect them and a simple charge sharing between the capacitors yields the accumulated result for the . That is when a single A/D conversion is performed, the cost of which is not only amortized across the parallel MACC units but also over time across multiple subvectors.
3.2 MixedSignal Wide Aggregator
only process lowbitwidth operands; however, they cannot combine these operations to enable higher bitwidth dotproducts. A collection of can provide this capability as discussed with Figure 1 in Section 2. This structure is named as it is a Mixed Signal Wide Aggregator. Figure 2(b) depicts a 2D array of a possible design, comprising 16 that are necessary to perform 8bit by 8bit vector dotproduct with 2bit partitioning. In this case, the number 16 comes from the fact that each of the two 8bit operands can be partitioned to four 2bit values. Each of the four 2bit partitions of the multiplicand need to be multiplyaccumulated with all the multiplier’s four 2bit partitions. As discussed in Section 2, each also performs the necessary shift operations to combine the lowbitwidth results from its 16 . By aggregating the partial results of each , the unit generates a scalar output which is stored on its output register. As illustrated in Figure 2, a collection of these constitute an accelerator core from which the clustered architecture of is designed.
3.3 Hierarchically Clustered Architecture
As discussed in Section 4, the proposed consumes less energy for a single 8bit MACC in comparison with a digital logic (1 pJ taken from the simulator [46], which is commensurate with other reports [47, 48]). As such, it is possible to integrate a larger number of mixedsignal compute units on a chip with a given power budget compared to a digital architecture. To efficiently utilize the larger number of available compute units, a high bandwidth memory substrate is required. Moreover, one of the large sources of energy consumption in DNN acceleration is offchip DRAM accesses [30, 28, 23]. To maximize the benefits of the mixedsignal computation, 3Dstacked memory is an attractive option since it reduces the cost of data accesses and provides a higher bandwidth for data transfer between the onchip compute and offchip memory [12, 25]. Based on these insights, we devise a clustered architecture for with a 3Dstacked memory substrate as shown in Figure 2(c). The mixedsignal logic die of is stacked over the DRAM dies with multiple vaults, each of which is connected to the logic die with several throughsiliconvia (TSV)s. The 3D memory substrate of is modeled using Micron’s Hybrid Memory Cube (HMC) [49, 50] which has been shown to be a promising technology for DNN acceleration [12]. As the results in Section 8.2 Figure 15 shows, a flat systolic design would result in significant underutilization of the compute resources and bandwidth from 3D stacking.
Therefore, is a hierarchically clustered architecture that allocates multiple accelerator cores as a cluster to each vault. Figure 2(b) depicts a single core. As shown in Figure 2(b), each core is selfsufficient and packs a mixedsignal systolic array of as well as the digital units that perform pooling, activation, normalization, etc. The mixedsignal array is responsible for the convolutional and fully connected layers. Generally, wide and interleaved bitpartitioned execution within is orthogonal to the organization of the accelerator architecture. This paper explores how to embed them and the proposed compute model, within a systolic design and enables endtoend programmable mixedsignal acceleration for a variety of DNNs.
Accelerator core. As Figure 2(b) depicts, the first level of hierarchy is the accelerator core and its 2D systolic array that utilizes the . As depicted, the Input Buffers and Output Buffers are shared across the columns and rows, respectively. Each has its own Weight Buffer. This organization is commensurate with other designs and reduces the cost of onchip data accesses as inputs are reused with multiple filters [26]. However, what makes our design different is the fact that each buffer needs to supply a subvector not a scalar in each cycle to the . However, the generates only a scalar since dotproduct generates a scalar output. The rewiring of the inputs and weights is already done inside the since the size of bitpartitions is fixed. As such, there is no need to reformat any of inputs, activations, or weights. As the outputs of flow down the columns, they get accumulated to generate the output activations that are fed to each columns dedicated Normalization/Activation/Pooling Units. To preserve the accuracy of the DNN model, the intermediate results are stored as 32bit digital values and intracolumn aggregations are performed in the digital mode.
Onchip data delivery for accelerator cores. To minimize data movement energy and maximally exploit the large degrees of datareuse offered by DNNs, uses a staticallyscheduled bus that is capable of multicasting/broadcasting data across accelerator cores. Compared to complex interconnections, the choice of staticallyscheduled bus significantly simplifies the hardware by alleviating the need for complicated arbitration logic and FIFOs/buffers required for dynamic routing. Moreover, the static schedule enables the compiler stack to cut the DNN layers across cores while maximizing inter and intracore datareuse. The static schedule is encoded in the form of data communication instructions (Section 7) that are responsible for (1) fetching data tiles from the 3Dstacked memory and distributing them across cores or (2) writing output tiles back from the cores to the memory.
Parallelizing computations across accelerator cores. Datamovement energy is a significant portion of the overall energy consumption both for digital designs [12, 23, 24, 28, 30, 51] and analog designs [33, 35]. As such, the clustered architecture (1) divides the computations into tiles that fit within the limited onchip capacity of the scratchpads that are private for each accelerator core, and (2) cuts the tiles of computations across cores to minimize DRAM accesses by maximally utilizing the multicast/broadcasting capabilities of onchip data delivery network. To simplify the design of the accelerator cores, the scratchpad buffers are private to each core and the shared data is replicated across multiple cores. Thus, a single tile of data can be read once from the 3Dstacked memory and then be broadcasted/multicasted across cores to reduce DRAM accesses. The cores use doublebuffering to hide the latency for memory accesses for subsequent tiles. The accelerator cores use outputstationary dataflow that minimizes the number of ADC conversions by accumulating results in the chargedomain. Section 6 discusses the compiler stack that optimizes the cuts and tile sizes for individual DNN layers.
4 SwitchedCapacitor Circuit Design for BitPartitioning
exploits switchedcapacitor circuitry [36, 34, 43, 42, 41] for by implementing MACC operations in the chargedomain rather than using resistiveladders to compute in current domain [32, 40, 44]. Compared to the currentdomain approach, switchedcapacitors (1) enable result accumulation in the analog domain by storing them as electric charge, eliminating the need for A/D conversion at every cycle, and (2) make multiplications depend only on the ratio of the capacitor sizes rather than their absolute capacitances. The second property enables reduction of capacitor sizes, improving the energy and area of MACC units as well as making them more resilient to process variation. The following discusses the details of the circuitry.
4.1 LowBitwidth SwitchedCapacitor MACC
Figure 3 depicts the design of a single 3bit signmagnitude MACC. The and denote the bitpartitions operands. The result of each MACC operation is retained as electric charge in the accumulating capacitor (). In addition to , the MACC unit contains two capacitive DigitaltoAnalog Converters, one for inputs () and one for weights (). The and convert the 2bit magnitude of the input and weight to the analog domain as an electric charge proportional to and respectively. and are each composed of two capacitors ((, ) and (, )) which operate in parallel and are combined to convert the operands to analog domain. Each of these capacitors are controlled by a pair of transmission gates which determine if a capacitor is active or inactive. Another set of transmission gates connects the two and shares charge when partitions of and are multiplied. The resulting shared charge is stored on either or depending on the “sign” control signal produced by . During multiplication, the transmission gates are coordinated by a pair of complimentary nonoverlapping clock signals, and .
Chargedomain MACC. Figure 4 shows the phasebyphase process of a MACC and its corresponding active circuits, the phases of which are described below.
: The first phase (Figure 4(a)) consists of the input capacitive DAC converting digital input () to a charge proportional to the magnitude of the input . As a result, the sampled charge () in in the first phase is equal to:
(1) 
: In the second phase (Figure 4(b)), the multiplication happens via a chargesharing process between and . converts the to the charge domain. At the same time, the redistributes its sampled charge () over all of its capacitors () as well as the equivalent capacitor of . The voltage () at the junction of and is as follows:
(2) 
Because the sampled charge is shared with the weight capacitors, the stored charge () on is equal to:
(3) 
Equation 3 shows that the stored charge on is proportional to , but includes a nonlinearity due to the term in the denominator. To suppress this nonlinearity, and must be chosen such that . Although this design choice does not completely suppress this nonlinearity, it can be mitigated as discussed in Section 5. With this choice, becomes .
: In the last phase, (Figure 4(c)), the charge from multiplication is shared with for accumulation. The sign bits ( and ) determine which of or is selected for accumulation. The sampled charge by is then redistributed over the selected as well as all the capacitors of (). Theoretically, must be infinitely larger than to completely absorb the charge from multiplication. However, in reality, some charge remains unabsorbed, leading to a pattern of computational error, which is mitigated as discussed in Section 5 Ideally, the voltage on is:
(4) 
While the charge sharing and accumulation happens on , a new input is fed into , starting a new MACC process in a pipelined fashion. This process repeats for all lowbitwidth MACC units over multiple cycles before one A/D conversion.
4.2 Wide MixedSignal BitPartitioned MACC
Figure 5(a) depicts an array of switchedcapacitor MACCs, constituting the unit, which perform operations for cycles in the analog domain and store the results locally on their . Figure 5(b) depicts the control signals and cycles of operations. For the microarchitecture, and are selected to and based on design space exploration (see Figure 14). Over cycles, the results of lowbitwidth MACC operations get accumulated in , private to each MACC unit. In cycle , the private results get aggregated across all the MACC units within the . The single A/D converter in the is responsible for converting the aggregated result, which also starts at cycle .
In the first phase of cycle , all the accumulating capacitors which store the positive values () are connected together through a set of transmission gates to share their charge. Simultaneously, the same process happens for the . in Figure 5 is the control signal which connects the . The accumulating capacitors (), are also connected to a Successive Approximation Register (SAR) ADC and share their stored charge with the Sample and Hold block (S&H) of the ADC. This (S&H) block has differential inputs which samples the positive and negative results separately, subtracts them and holds them for the process of A/D conversion. In the second phase of the cycle , connects all the to ground to clear them for the next iteration of wide, bitinterleaved calculations.
There is a tradeoff between resolution and sampling rate of ADC, which also defines its topology. SAR ADC is a better choice when it comes to medium resolution (812 bits) and sampling rate (1500 MegaSamples/sec). We choose a 10bit, 15 MegaSamples/sec SAR ADC as it strikes the better balance between speed and resolution for . The design space exploration in Figure 14 shows that this choice makes the grouping of 8 lowbitwidth MACCs optimal for cycles of operation. The process of A/D conversion takes cycles, pipelined with the subvector dotproduct. Table 1 shows the energy breakdown within a that uses 2bit partitioning. As shown, performing an 8bit MACC using the interleaved bitpartitioned arithmetic requires less energy than a digital MACC which consumes around 1 pJ [12].
Units  Energy (femto Joule) 

1 MACC  5.1 fJ 
256 MACCs  1,305.6 fJ 
SAR ADC (for 256 MACCs)  1,660.0 fJ 
Total Energy  1,956.6 fJ 
Total Energy per 2b2b MACC  11.6 fJ 
Total Energy per 8b8b MACC  185.3 fJ 
5 MixedSignal NonIdealities and Their Mitigation
Although analog circuitry offers significant reduction in energy, they might lead to accuracy degradation. Thus, their error needs to be properly modeled and accounted for. Specifically, , the main analog component, can be susceptible to (1) thermal noise, (2) computational error caused by incomplete charge transfer, and (3) PVT variations. Traditionally, analog circuit designers mitigate sources of error by just configuring hardware parameters to values which are robust to nonidealities. Such hardware parameter adjustments require rather significant energy/area overheads that scale linearly with number of modules. The overheads are acceptable in conventional analog designs since modules are few in numbers. However, due to the repetitive and scaledup nature of our design, we need to mitigate these nonidealities in a higher and algorithmic level. We leverage the training algorithm’s inherent mechanism to reduce error (loss) and use mathematical models to represent these nonidealities. We, then, apply these models during the forward pass to
adjust and finetune pretrained neural models with just a few more epochs
across the chips within a technology node. The rest of this section details nonidealities and their modeling. It, then elaborates on how PVT variations are considered in formulations.5.1 Thermal Noise
Thermal noise is an inherent perturbation in analog circuits caused by the thermal agitation of electrons, distorting the main signal. This noise can be modeled according to a normal distribution, where the ideal voltage deviates relative to a value comprised of the working temperature (T), Boltzmann constant (k), and capacitor size (C) which produce the deviation
. Within , switchedcapacitor MACC units are mainly effected by the combined thermal noise resulting from weights and accumulator capacitors (and respectively). The noise from these capacitors gets accumulated during the cycles of computation for each individual MACC unit and then gets aggregated across the MACC units in . By applying the thermal noise equation used for similar MACC units [42]to a unit, the standard deviation at the output is described by Equation
5:(5) 
In the above equation, is equal to . We apply the effect of thermal noise in the forward propagation of DNN by adding an error tensor to the output of convolutional and fully connected layers. Having computed the standard deviation of noise for a single (), each element of the error tensor is sampled from a normal distribution as follows:
(6) 
In the above equation, is scaled by which is the amount of operations required to generate one element in the output feature map as well as the amount of total bitshifts applied to each result by unit, .
5.2 Computational Error
Another source of error in ’s chargedomain computations arises when charge is shared between capacitors during the multiplication and accumulation. Within each MACC unit, the input capacitors () transfer a sampled charge to the weight capacitors () to produce charge proportional to the multiplication result. But the resulting charge is subject to error dependent on the ratio of weight and input capacitor sizes () as shown in Equation 3. This shared charge in the weight capacitors introduces more error when it is redistributed to the accumulating capacitor () which cannot absorb all of the charge, leaving a small portion remaining on the weight capacitors in subsequent cycles. The ideal voltage () produced after cycles of multiplication can be derived from Equation 4 as follows:
(7) 
By considering the computational error from incomplete charge sharing, the actual voltage at the accumulating capacitor after cycles of MACC operations () becomes:
(8) 
Computational error is accounted for in the finetuning pass by including the multiplicative factors shown in Equation 8 in weights. During the forward pass, the finetuning algorithm decomposes weight tensors in convolutional and fullyconnected layers into groups corresponding to configuration and updates the individual weight values () to new values () with the computational error in Equation 9:
(9) 
5.3 ProcessVoltageTemperature Variations
Process variations. We use the sizing of the capacitors to provision and mitigate for the process variations to which the switchedcapacitor circuits are generally robust. The robustness and the mitigation are effective because the capacitors are implemented using a number of smaller unit capacitors with commoncentroid layout technique [52]. We, specifically, use the metalfringe capacitors for MACCs with mismatch of just 1% standard deviation [53] with the max variation of 6% () which is well below the error margins considered for the computational correctness of .
Temperature variations. We model the temperature variations by adding a perturbation term to in Equation 5
. We consider the maximum value of the temperature as 358°K which is commensurate with existing practices [54], and the minimum value as 300°K (This is the peaktopeak range for the gaussian distribution ()).Voltage variations. We also model the voltage variation by adding a gaussian distribution to term in Equation 9. Our experiments show that, variations in voltage can be mitigated up to 20%. The extensive amount of vector dotproduct operations in DNNs, allows for the minimum and maximum values of the distributions being sampled sufficient amount of times, leading to coverage of the corner cases.
Atop all these considerations, we use differential signaling for ADCs which attenuates the commonmode fluctuations such as PVT variations. To show the effectiveness of our techniques, Figure 6 plots the result of finetuning process of two benchmarks, ResNet50 and VGG16 for ten epochs. Table 4 reports the summary of accuracy trends for all the benchmarks, which achieve less than 0.5% loss. As Figure 6 shows, the finetuning pass compensates the initial loss (0.73% for top1 and 2.41% for top5) to only 0.04% for top1 and 0.02% for top5. VGG16 is slightly different and reduces the initial loss (1.16% for top1 and 2.24% for top5) to less than 0.18% for top1 and 0.13% for top5 validation accuracy. The trends are similar for other benchmarks and omitted due to space constraints.
6 Compiler Stack
As Figure 7 shows, DNNs are compiled to through a multistage process beginning with a Caffe2 [55]
DNN specification file. The highlevel specification provided in the Caffe2 file is translated to a layer DataFlow Graph (DFG) that preserves the structure of the network. The DFG goes through an algorithm that cuts the DFG and tiles the data to map the DNN computations to the accelerator clusters and cores. The tiling also aims to minimize the transfer of model parameters to limited onchip scratchpads on the logic die from the 3Dstacked DRAM, while maximizing the utilization of the compute resources. In addition to the DFG, the cutting/titling algorithm takes in the architectural specification of the . These specifications include the organizations and configurations (# rows, #columns) of the clusters, vaults, and cores as well as details of the . To identify the best cuts and tilings, the cutting/tiling algorithm exhaustively searches the space of possibilities, which is enabled through an estimation tool. The tool estimates the total energy consumption and runtime for each cuts/tiles pair which represent the data movement and resource utilization in . Estimation is viable, as the DFG does not change, there is no hardware managed cache, and the accelerator architecture is fixed during execution. Thus, there are no irregularities that can hinder estimation. Algorithm 1 depicts the cutting/tiling procedure. When cuts and tiles are determined, the compiler generates the binary code that contains the communication and computation instruction blocks. As commensurate with stateoftheart accelerators
[12, 28, 23, 25, 18], all the instructions are statically scheduled. We extend the static scheduling to cluster coordination, data communication and transfer.7 Instruction Set
The ISA exposes the following unique properties of its architecture to the software: (1) efficient mixedsignal execution using bitpartitioned and capacitive accumulation, and (2) clustered architecture, that takes advantage of the power efficiency of mixedsignal acceleration to scaleup the number of in . As such, uses a blockstructured ISA that segregates the execution of the DNN into (1) data communication instruction blocks that accesses tiles of data from the 3Dstacked memory and populates the onchip scratchpads (Input Buffer/Weight Buffer/Output Buffer in Figure 2), and (2) compute instruction blocks each of which consumes the tile of data produced by a corresponding communication instruction block and produces an output tile. The compiler stack statically assigns communication and compute instruction blocks to accelerator clusters, shifting the complexity from hardware to the compiler. By splitting the data transfer and onchip data processing into separate instructions, the ISA enables software pipelining between clusters and allows the memory accesses to run ahead and fetch data for the next tile while processing the current tile.
Compute instruction block. A block of compute instructions expresses the entire computation to produce a single tile in an accelerator core. Further, the compute block governs how the input data for a DNN layer is bitpartitioned and distributed across wide aggregators within a single core. As such, the compiler has complete control over the read/write accesses to onchip scratchpads, A/D and D/A conversion, and execution using the and digital blocks in an accelerator core. The granularity of bitpartitioning and chargebased accumulation is determined for each microarchitectural implementation based on the technology node and circuit design paradigm. As such, to support different technology nodes and design styles and allow extensions to the architecture, the ISA encodes the bitpartitioning and accumulation cycles. However, we need to explore the design space to find the optimal design choice for each combination of technology node and circuits (Section 8).
Communication instruction block. The key challenge when scaling up the design is to minimize datamovement while parallelizing the execution of the DNN across the onchip compute resources. To simplify the hardware, instruction set captures the static schedule of data movement as a series of communication instruction blocks. Static scheduling is possible as the topology of the DNN does not change during inference and the order of layers and neurons is known statically. The compiler stack assigns the communication blocks to the cores according to the order of the layers. This static ordering enables to use a simple statically scheduled bus instead of a more complex interconnection.
To maximize energy efficiency, it is imperative to exploit the high degree of datareuse offered by DNNs. To exploit datareuse when parallelizing computations across cores of the architecture, the communication instructions support broadcasting/multicasting to distribute the same data across multiple cores, minimizing offchip memory accesses. Once a communication block writes a tile of data to the onchip scratchpads, it can be reused over multiple compute blocks to exploit temporal data locality within a single accelerator core.
8 Evaluation
8.1 Methodology
DNN  Type  Domain  Dataset  MultiplyAdds  Model Weights 

AlexNet [56]  CNN  Image Classification  Imagenet [57]  2,678 MOps  56.1 MBytes 
CIFAR10 [58, 59]  CNN  Image Classification  CIFAR10 [60]  617 MOps  13.4 MBytes 
GoogLeNet [61]  CNN  Image Classification  Imagenet  1,502 MOps  13.5 MBytes 
ResNet18 [62]  CNN  Image Classification  Imagenet  4,269 MOps  11.1 MBytes 
ResNet50 [62]  CNN  Image Classification  Imagenet  8,030 MOps  24.4 MBytes 
VGG16 [58]  CNN  Object Recognition  Imagenet  31 GOps  131.6 MBytes 
VGG19 [58]  CNN  Object Recognition  Imagenet  39 GOps  137.3 MBytes 
YOLOv3 [63]  CNN  Object Recognition  Imagenet  19 GOps  39.8 MBytes 
PTBRNN [59]  RNN  Language Modeling  Penn TreeBank [64]  17 MOps  16 MBytes 
PTBLSTM [65]  RNN  Language Modeling  Penn TreeBank  13 MOps  12.3 MBytes 
Parameters  ASIC  Parameters  GPU  

Chip  Chip  RTX 2080 TI  Titan Xp  
MACCs  16,384  3,136  Tensore Cores  544  — 
Onchip Memory  9216 KB  3698 KB  Memory  11 GB (GDDR6)  12 GB (GDDR5X) 
Chip Area ()  122.3  56  Chip Area ()  754  471 
Total Dissipation Power  250 W  250 W  
Frequency  500 Mhz  500 Mhz  Frequency  1545 Mhz  1531 Mhz 
Technology  45 nm  45 nm  Technology  12 nm  16 nm 
Benchmarks. We use ten diverse CNN and RNN models to evaluate , described in Table 2 that perform image classification, realtime object detection (YOLOv3), and characterlevel (PTBRNN) and wordlevel (PTBLSTM) language modeling. This set of benchmarks includes medium to large scale models (from 11.1 MBytes to 137.3 MBytes) and variety of multiplyadd operations (from 13 Million to 39 Billion).
Simulation infrastructure. We develop a cycleaccurate simulator and a compiler for that takes in a caffe2 specification of the DNN, finds the optimum tiling and cutting for each layer, and maps it to architecture. The simulator executes each of the optimized network using the architecture model and reports the total runtime and energy.
comparison. We compare with , a stateoftheart fullydigital 3Dstacked dataflow accelerator. We match the onchip power dissipation of and and compare the total runtime and energy, including energy for DRAM accesses. We also perform an isoarea comparison and scale up original with 16 vaults to 36 vaults to match its area to ’s. The baseline supports 16bit execution while supports 8bit. For fairness, we modify the opensource simulator [46] and proportionally scale its runtime and energy. supports 8bit operands since this representation has virtually no impact by itself on the final accuracy of the DNNs [59, 66, 67, 68, 69].
GPU comparison. We also compare to two Nvidia GPUs (i.e., RTX 2080 TI and Titan Xp) based on Turing and Pascal architecture respectively, listed in Table 3
. RTX 2080 TI’s Turing architecture provides tensor cores which are specialized hardware for deep learning inference. We use 8bit on GPUs using Nvidia’s own TensorRT 5.1
[70] library compiled with the optimized cuDNN 7.5 and CUDA 10.1. For each DNN benchmark, we perform 1,000 warmup iterations and report the average runtime across 10,000 iterations.Comparison with other recent accelerators. We also compare to Google TPU [26], mixedsignal CMOS RedEye [35], and two analog memristive accelerators. All the comparisons are in 8bits. The original designs [32, 71] use 16bits. Scaling from 16bit to 8bit execution for memristive designs would optimistically provide a increase in efficiency.
Energy and area measurement. All hardware modelings are performed using FreePDK 45nm standard cell library [72]. We implement the switchedcapacitor MACCs in Cadence Analog Design Environment V6.1.3 and use Spectre SPICE V6.1.3 to model the system. We then, use Layout XL of Cadence to lay out the MACC units and extract the energy/area. The ADC’s energy/area are obtained from [73]. Based on the configuration, we use the ADC architecture from [74].
We implement all digital blocks of , including adders, shifters, interconnection, and accumulators in Verilog RTL and used Synopsys Design Compiler (L2016.03SP5) to synthesize them and measure their energy and area. For onchip SRAM buffers, we use CACTIP [75] to measure the energy and area of the memory blocks. The 3Dstacked DRAM architecture is based on HMC stack [49, 50], the same as , and the bandwidth and access energy are adopted form that work.
Error modeling. For error modeling, we use Spectre SPICE V6.1.3 to extract the noise behavior of MACCs via circuit simulations. Thermal noise, computational error, and PVT variations are considered based on details in Section 5
. We implement the extracted hardware error models and the corresponding mathematical modelings using PyTorch v1.0.1
[76] and integrate them into Neural Network Distiller v0.3 framework [77] for a finetuning pass over the evaluated benchmarks.8.2 Experimental Results
8.2.1 Comparison with
Isopower performance and energy comparison. Figure 8 shows the performance and energy reduction of over under the same onchip power budget. On average, delivers a speedup over . This significant improvement is attributed to the use of wide mixedsignal in as opposed to PEs in . The wide bitpartitioned mixedsignal design of in enables us to cram 5 more compute units within the same power budget as . The highest speedup is observed in YOLOv3 and PTBRNN, where their networks’ configurations favor the wide vectorized execution in by better utilizing compute resources. The lowest speedup is observed in ResNet18, since its relatively small size leads to underutilization of compute resources in .
Figure 8 demonstrates the total energy reduction for across the evaluated benchmarks as compared to . On average, yields energy reduction over , including energy for DRAM accesses, while consuming the same onchip power as . CIFAR10 enjoys the highest energy reduction, since is able to take advantage of CIFAR10’s smaller memory footprint to maximize onchip data reuse and reduce DRAM accesses. The lowest energy reduction is observed in RNN benchmarks, PTBRNN and PTBLSTM since the matrixvector operations in these benchmarks require a significant number of memory accesses, diminishing the benefits from mixedsignal computations.
Energy breakdown.
Figure 9
shows the energy breakdown normalized to . Energy breakdown is reported across four major architectural components: (1) onchip compute units, (2) onchip memory (buffers and register file), (3) interconnect, and (4) 3Dstacked DRAM. DRAM accesses account for the highest portion of the energy in , since significantly reduces the onchip compute energy. While has a significantly larger number of compute resources compared to , the number of DRAM accesses remain almost the same. This is because the staticallyscheduled bus allows data to be multicasted/broadcasted across multiple cores in without significantly increasing the number of DRAM accesses. Furthermore, the staticallyscheduled bus offers the compiler stack the freedom to optimize partitioning the computations across cores. Most layers in the benchmarks benefit for partitioning the different inputs in a single batch (batch size is 16) across cores and broadcasting weights, which is not explored in . As a result, these networks have lower DRAM accesses. The breakdown of energy consumption varies with the type of computations required by the DNN as well as the degree of datareuse. Benchmarks PTBRNN and PTBLSTM are recurrent neural networks that perform large matrixvector operations and require significant DRAM accesses for weights. Therefore, PTBRNN and PTBLSTM use more energy for DRAM accesses compared to other benchmarks.
Unlike the fullydigital PEs in that perform a single operation in a cycle, uses which perform wide vectorized operations–crucial in to amortize the high cost of ADCs. As shown in Table 1, each MACC operation in consumes 5.4 less energy compared to . The outputstationary dataflow enabled by capacitive accumulation in addition to the systolic organization of in each core of which eliminates the need for register files unlike , leads to 4.4 reduction for onchip data movement on average.
Isoarea comparison with . We compare the total runtime and energy of with a scaled up version of which matches ’area. Figure 10 shows the results for the workloads. Scalingup the compute resources in by 2.25 to match the chiparea of results in a sublinear increase in performance by . This improvement in performance comes at a cost of reduced energyefficiency due to an increase in memory accesses to feed the additional compute resources. The trends in speedup and energyreduction remain the same as isopower comparison, with the exception of ResNet18, which now sees resource underutilization in after scaling up number of compute resources.
8.2.2 Comparison to GPUs
Figure 11 compares performance of with Titan Xp and RTX 2080 TI. RTX 2080 TI is based on Nvidia’s latest architecture, Turing. For a fair comparison, we enable vectorized 8bit operations and optimized GPU compilations. The results are normalized to Titan Xp. , on average, yields 70% speedup over Titan Xp GPU and performs 15% slower than RTX 2080 TI. Convolutional networks require large amount of matrixmatrix multiplications that are wellsuited for tensor cores, leading to RTX 2080 TI’s outperformance on both and Titan Xp. VGG16 and VGG19 see the maximum benefits. However, outperforms RTX 2080 TI GPU in PTBRNN and PTBLSTM with 11.2 and 10.6, respectively. These RNN networks require matrixvector multiplications, which is particularly suitable for the wide vectorized operations supported in ’s –not the best case for tensor cores. In terms of performanceperWatt, outperforms both Titan Xp and RTX 2080 TI GPUs by a large margin, and , respectively.
8.2.3 Comparison with Other Accelerators
We also compare the power efficiency (GOPS/s/Watt) and area efficiency GOPS/s/ of with other recent digital and analog accelerators. Due to the lack of available raw performance/energy numbers for specific workloads, we use these metrics that is commensurate with comparisons for recent designs [21, 71, 78]. Figure 12 depicts the peak power and area efficiency results.
On average for the evaluated benchmarks, achieves 72% of its peak efficiency. This information is not available in the publications for the other designs.
Digital systolic: Google TPU [26]. In comparison with TPU, which also uses systolic design, delivers 4.5 more peak power efficiency and almost the same area efficiency. Leveraging the wide, interleaved, and bitpartitioned arithmetic with its switchedcapacitor implementation in architecture, reduces the cost of MACC operations significantly compared with TPU which uses 8bit digital logic, leading to significant improvement in power efficiency.
Mixedsignal CMOS: RedEye [35]. RedEye is an insensor CNN accelerator baed on mixedsignal CMOS technology which also uses switchedcapacitor circuitry for MACC operations. Compared to RedEye, offers 5.5 better power efficiency and 167 better area efficiency. Utilizing the proposed wide, interleaved, and bitpartitioned arithmetic amortizes the cost of ADC in by reducing its required resolution and sampling rate, leading to significant curtailment of ADC power and area, in contrast to RedEye.
Analog Memristive designs [32, 71]. Prior work in ISAAC [32] and PipeLayer [71] have explored analog memristive technology for DNN acceleration, which integrates both compute and storage within the same die, and offers higher compute density compared to traditional analog CMOS technology. However, this increase in compute density comes at the cost of reduced powerefficiency. Generally, memrisitive designs perform computations in the current domain, requiring the costly ADCs to sample the currentdomain signals at the same rate as the compute/storage for memristors. PipeLayer significantly reduces this cost. Overall, compared to ISAAC and PipeLayer, improves the power efficiency by 3.6 and 9.6, respectively.
8.2.4 Design Space Explorations
Design space exploration for bitpartitioning.
To evaluate the effectiveness of bitpartitioning, we perform a design space exploration with various bitpartitioned options. Figure 13 shows the reduction in energy and area compared to an 8bit8bit design when two vectors with 32 elements go under dotproduct. The other design points also perform 8bit8bit MACC operations while utilizing our wide and interleaved bitpartitioned arithmetic. As depicted, the design with 2bit partitioning strikes the best balance in energy and area with the switchedcapacitor design of MACC units at 45 nm CMOS node. The difference between 2bit and 1bit is that singlebit partitioning quadratically increases the number of low bitwidth MACCs from 16 (2bit partitioning ) to 64 (1bit partitioning) to support 8bit operations. This imposes disproportionate overhead that outweighs the benefit of decreasing each MACC units area and energy.
Design space exploration for configuration.
The number of accumulation cycles () before the A/D conversion and the number of MACC units () are two main parameters of which define resolution and the sample rate of the ADC, determining its power. Figure 14 shows the design space exploration for different configurations of the . In a fixed power budget of W for compute units, we measure the total runtime and energy of over the evaluated workloads which are normalized to . As shown in Figure 14, increasing number of MACCs, limits the number of accumulation cycles, consequently leading to using ADCs with high samplerates. Using high samplerate ADCs significantly increases power, making the design less efficient. On the other hand, increasing number of accumulation cycles, limits the number of MACCs, which restricts the number of that can be integrated into the design under the given power budget. Overall, the optimal design point that delivers the best performance and energy is with eight MACC units and 32 accumulation cycles.
Design space exploration for clustered architecture.
uses a hierarchical architecture with multiple cores in each vault. Having a larger number of small cores for each vault yields increased utilization of compute resources, but requires data transfer across cores. We explore the design space with 1, 2, 4, and 8 cores per cluster.As Figure15 shows, with four cores per each vault (default configuration in ) strikes the best balance between speedup and energy reduction. Performance increases as we increase the number of cores per vault from 1 to 8. However, the 8core configuration results in a higher number of data accesses. Therefore, the 4core design point provides the optimal balance.
8.2.5 Evaluation of Circuitry NonIdealities
Table 4 shows the Top1 accuracy with considering nonidealities, after finetuning, the ideal accuracy, and the final loss in accuracy.
DNN Model  Dataset 






AlexNet  Imagenet  53.12%  56.64%  57.11%  0.47%  
CIFAR10  CIFAR10  90.82%  91.01%  91.03%  0.02%  
GoogLeNet  Imagenet  67.15%  68.39%  68.72%  0.33%  
ResNet18  Imagenet  66.91%  68.96%  68.98%  0.02%  
ResNet50  Imagenet  74.5%  75.21%  75.25%  0.04%  
VGG16  Imagenet  70.31%  71.28%  71.46%  0.18%  
VGG19  Imagenet  73.24%  74.20%  74.52%  0.32%  
YOLOv3  Imagenet  75.92%  77.1%  77.22%  0.21%  
PTBRNN  Penn TreeBank  1.1 BPC  1.6 BPC  1.1 BPC  0.0 BPC  
PTBLSTM  Penn TreeBank  97 PPW  170 PPW  97 PPW  0.0 PPW 
As shown in Table 4, some of the networks, namely AlexNet and ResNet18, are more sensitive to the nonidealities, leading to a higher initial accuracy degradation. To recover the accuracy loss due to the circuitry nonidealities, we perform a finetuning step for a few epochs. By performing this finetuning step, the accuracy loss of the CIFAR10, ResNet18, and ResNet50 networks is fully recovered (loss is less than 0.04%) which within these networks, CIFAR10 and ResNet50 are more robust to nonidealities. The accuracy loss for other networks is below 0.5% which within those AlexNet has the maximum loss. The final two networks, namely PTBRNN and PTBLSTM perform characterlevel and wordlevel language modeling, respectively. The accuracy for these two networks is measured in BitsPerCharacter (BPC) and PerplexityperWord (PPW), respectively. Both PTBRNN and PTBLSTM recover all the loss after finetuning. The final results after finetuning step show the effectiveness of this approach in recovering the accuracy loss due to the nonidealities pertinent to analog computation.
9 Related Work
There is a large body of work on digital accelerators for DNNs [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Mixedsignal acceleration has also been explored previously for neural network [34, 40] and is gaining traction again for the deep models [32, 33, 35, 36, 37, 38, 39, 41, 42]. This paper fundamentally differs from these inspiring efforts as it delves into the mathematics of basic operations in DNNs, reformulates and defines the wide, interleaved, and bitpartitioned approach to overcome the challenges of mixedsignal acceleration. By partitioning and reaggregating the lowbitwidth MACC operations, this paper addresses the limited range of encoding and reduces the cost of crossdomain conversions. Additionally, it combines the proposed mathematical reformulation with switchedcapacitor circuitry to share and delay A/D conversions, which amortizes their cost and reduce their rate, respectively. Below, we discuss the most related works.
Switchedcapacitor design. Switchedcapacitor circuits [43] have a long history, having been mainly used for designing amplifiers[79], A/D and D/A converters[80] and filters[81]. Similar to resistive circuits, they have been used even for the previous generation of neural networks [34]. More recently, they have also been used for matrix multiplication[82, 42], which can benefit DNNs. This work takes inspiration from these efforts but differes from them in that it defines and leverages wide, interleaved, and bitpartitioned reformulation of DNN operations. Additionally, it offers a comprehensive architecture that can accelerate a wide variety of DNNs.
Programmable mixedsignal accelerators. PROMISE [33] offers a mixedsignal architecture that integrates analog units within the SRAM memory blocks. RedEye[35] is a lowpower nearsensor mixedsignal accelerator that uses chargedomain computations. These works do not offer wide interleavings of bitpartitioned basic operations as described in this paper.
Fixedfunctional mixedsignal accelerators. They are designed for a specific DNN. Some focus on handwritten digit classification [82, 83]
or binarized mixedsignal acceleration of CIFAR10 images
[38]. Another work focuses on spiking neural networks’ acceleration [39]. In contrast, our design is programmable and supports interleaved bitpartitioning.Resistive memory accelerators. There is a large body of work using resistive memory [32, 71, 78, 84, 85, 86, 87, 88]. We provided a direct comparison to ISAAC [32] and PipeLayer [71]. ISAAC [32] most notably introduces the concept of temporally bitserial operations, also explored in PRIME [44], and is augmented with the concept of spikebase data scheme in PipeLayer [71]. , in contrast, formulates a partitioning that spatially groups lowerbitwidth MACCs across different vector elements and performs them inparallel. PRIME does not provide absolute measurements and its simulated baseline is not available for a headtohead comparison. PRIME also uses multiple truncations that change the mathematics. Conversely, our formulation does not induce truncation or mathematical changes.
10 Conclusion
This work proposes wide, interleaved, and bitpartitioned arithmetic to overcome two key challenges in mixedsignal acceleration of DNNs: limited encoding range, and costly A/D conversions. This bitpartitioned arithmetic enables rearranging the highly parallel MACC operations in modern DNNs into wide lowbitwidth computations that are mapped efficiently to mixedsignal units. Further, these units operate in charge domain using switchedcapacitor circuitry and reduce the rate of A/D conversions by accumulating partial results in the charge domain. The resulting microarchitecture, named , offers significant benefits over its stateoftheart analog and digital counterparts. These encouraging results suggest that the combination of mathematical insights with architectural innovations can enable new avenues in DNN acceleration.
References
 Niehues et al. [2018] J. Niehues, N.Q. Pham, T.L. Ha, M. Sperber, and A. Waibel. LowLatency Neural Speech Translation. ArXiv eprints, August 2018.
 Mo and Sattar [2018] J. Mo and J. Sattar. SafeDrive: Enhancing Lane Appearance for Autonomous and Assisted Driving Under Limited Visibility. ArXiv eprints, July 2018.
 Li et al. [2018] R. Li, Y. Shu, J. Su, H. Feng, and J. Wang. Using deep Residual Network to search for galaxyLyalpha emitter lens candidates based on spectroscopicselection. ArXiv eprints, July 2018.

Rohde et al. [2018]
D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou.
RecoGym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising.
ArXiv eprints, August 2018.  Grabec et al. [2018] I. Grabec, E. Švegl, and M. Sok. Development of a sensoryneural network for medical diagnosing. ArXiv eprints, July 2018.
 Esmaeilzadeh et al. [2011] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. In ISCA, 2011.
 Hardavellas et al. [2011] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6–15, July–Aug. 2011.
 Venkatesh et al. [2010] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose LugoMartinez, Steven Swanson, and Michael Bedford Taylor. Conservation cores: Reducing the energy of mature computations. In ASPLOS, 2010.

Zhang et al. [2015a]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.
Optimizing fpgabased accelerator design for deep convolutional neural networks.
In FPGA, 2015a.  Esmaeilzadeh et al. [2013] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for generalpurpose approximate programs. to apear in Commun. ACM, 2013.

Chen et al. [2014a]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li,
Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al.
Dadiannao: A machinelearning supercomputer.
In MICRO, 2014a.  Gao et al. [2017a] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. In ASPLOS, 2017a.
 Delmas et al. [2017] Alberto Delmas, Sayeh Sharify, Patrick Judd, and Andreas Moshovos. Tartan: Accelerating fullyconnected and convolutional layers in deep learning networks by exploiting numerical precision variability. arXiv, 2017.
 Mahajan et al. [2016] Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kim, and Hadi Esmaeilzadeh. TABLA: A unified templatebased framework for accelerating statistical machine learning. In HPCA, 2016.
 Zhang et al. [2016] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambriconx: An accelerator for sparse neural networks. In MICRO, 2016.
 Albericio et al. [2016] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. Cnvlutin: ineffectualneuronfree deep neural network computing. In ISCA, 2016.
 Judd et al. [2016] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M Aamodt, and Andreas Moshovos. Stripes: Bitserial deep neural network computing. In MICRO, 2016.
 Sharma et al. [2016] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kim, Chenkai Shao, Asit Misra, and Hadi Esmaeilzadeh. From highlevel deep neural models to fpgas. In MICRO, 2016.
 Chung et al. [2017] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Christian Boehn, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Steve Reinhardt, Adam Sapek, Raja Seera, Balaji Sridharan, Lisa Woods, Phillip YiXiao, Ritchie Zhao, and Doug Burger. Accelerating persistent neural networks at datacenter scale. In HotChips, 2017.
 Parashar et al. [2017] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An Accelerator for Compressedsparse Convolutional Neural Networks. In ISCA, 2017.
 Andri et al. [2016] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An ultralow power convolutional neural network accelerator based on binary weights. arXiv, 2016.
 Han et al. [2016] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on compressed deep neural network. In ISCA, 2016.
 Chen et al. [2016] YuHsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks. In ISCA, 2016.
 Chen et al. [2017a] YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. JSSC, 2017a.
 Kim et al. [2016] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with highdensity 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 380–392. IEEE, 2016.
 Jouppi et al. [2017] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In ISCA, 2017.
 Chen et al. [2014b] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning. In ASPLOS, 2014b.
 [28] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit fusion: Bitlevel dynamically composable architecture for accelerating deep neural networks.
 Aklaghi et al. [2018] Vahide Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, Hadi Esmaeilzadeh, and Rajesh K. Gupta. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In ISCA, 2018.
 Hegde et al. [2018] Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W Fletcher. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. arXiv preprint arXiv:1804.06508, 2018.
 Lee et al. [2018] Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and HoiJun Yoo. Unpu: A 50.6 tops/w unified deep neural network accelerator with 1bto16b fullyvariable weight bitprecision. In ISSCC, 2018.
 Shafiee et al. [2016] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars. In ISCA, 2016.
 Srivastava et al. [2018] Prakalp Srivastava, Mingu Kang, Sujan K Gonugondla, Sungmin Lim, Jungwook Choi, Vikram Adve, Nam Sung Kim, and Naresh Shanbhag. Promise: An endtoend design of a programmable mixedsignal accelerator for machinelearning algorithms. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.
 Tsividis and Anastassiou [1987] YP Tsividis and D Anastassiou. Switchedcapacitor neural networks. Electronics Letters, 23(18):958–959, 1987.
 LiKamWa et al. [2016] Robert LiKamWa, Yunhui Hou, Julian Gao, Mia Polansky, and Lin Zhong. Redeye: analog convnet image sensor architecture for continuous mobile vision. In ACM SIGARCH Computer Architecture News, volume 44, pages 255–266. IEEE Press, 2016.
 Bankman and Murmann [2015] Daniel Bankman and Boris Murmann. Passive charge redistribution digitaltoanalogue multiplier. Electronics Letters, 51(5):386–388, 2015.
 Lee and Wong [2017a] E. H. Lee and S. S. Wong. Analysis and design of a passive switchedcapacitor matrix multiplier for approximate computing. IEEE Journal of SolidState Circuits, 52(1):261–271, Jan 2017a. ISSN 00189200. doi: 10.1109/JSSC.2016.2599536.
 Bankman et al. [2018] Daniel Bankman, Lita Yang, Bert Moons, Marian Verhelst, and Boris Murmann. An alwayson 3.8 j/86% cifar10 mixedsignal binary cnn processor with all memory on chip in 28nm cmos. In SolidState Circuits Conference(ISSCC), 2018 IEEE International, pages 222–224. IEEE, 2018.
 Buhler et al. [2017] Fred N Buhler, Peter Brown, Jiabo Li, Thomas Chen, Zhengya Zhang, and Michael P Flynn. A 3.43 tops/w 48.9 pj/pixel 50.1 nj/classification 512 analog neuron sparse coding neural network with onchip learning and classification in 40nm cmos. In VLSI Circuits, 2017 Symposium on, pages C30–C31. IEEE, 2017.
 St. Amant et al. [2014] Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. Generalpurpose code acceleration with limitedprecision analog computation. In ISCA, 2014.

Zhang et al. [2015b]
Jintao Zhang, Zhuo Wang, and Naveen Verma.
18.4 a matrixmultiplying adc implementing a machinelearning classifier directly with data conversion.
In SolidState Circuits Conference(ISSCC), 2015 IEEE International, pages 1–3. IEEE, 2015b.  Lee and Wong [2017b] Edward H Lee and S Simon Wong. Analysis and Design of a Passive SwitchedCapacitor Matrix Multiplier for Approximate Computing. IEEE Journal of SolidState Circuits, 52(1):261–271, 2017b.
 Gray et al. [2001] Paul R Gray, Paul Hurst, Robert G Meyer, and Stephen Lewis. Analysis and design of analog integrated circuits. Wiley, 2001.
 Chi et al. [2016a] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processinginmemory architecture for neural network computation in rerambased main memory. In ISCA, 2016a.
 Sharify et al. [2017] Sayeh Sharify, Alberto Delmas Lascorz, Patrick Judd, and Andreas Moshovos. Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks. arXiv, 2017.
 Gao et al. [2017b] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. https://github.com/stanfordmast/nn_dataflow, 2017b.
 Li and Pedram [2017] Yuanfang Li and Ardavan Pedram. Caterpillar: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. In Applicationspecific Systems, Architectures and Processors (ASAP), 2017 IEEE 28th International Conference on, pages 1–10. IEEE, 2017.
 Upadhyay and Roy Chowdhury [2015] Himani Upadhyay and Shubhajit Roy Chowdhury. A high speed and low power 8 bit x 8 bit multiplier design using novel two transistor (2t) xor gates. Journal of Low Power Electronics, 01 2015. doi: 10.1166/jolpe.2015.1362.
 Consortium et al. [2013] Hybrid Memory Cube Consortium et al. Hybrid memory cube specification 1.0. Last Revision Jan, 2013.
 Jeddeloh and Keeth [2012] Joe Jeddeloh and Brent Keeth. Hybrid memory cube new dram architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on, pages 87–88. IEEE, 2012.
 Yazdanbakhsh et al. [2018] Amir Yazdanbakhsh, Hajar Falahati, Philip J. Wolfe, Kambiz Samadi, Hadi Esmaeilzadeh, and Nam Sung Kim. GANAX: A Unified SIMDMIMD Acceleration for Generative Adversarial Network. In ISCA, 2018.
 Ismail and Fiez [1994] Mohammed Ismail and Terri Fiez. Analog VLSI: signal and information processing, volume 166. McGrawHill New York, 1994.
 Tripathi and Murmann [2014] Vaibhav Tripathi and Boris Murmann. Mismatch characterization of small metal fringe capacitors. IEEE Transactions on Circuits and Systems I: Regular Papers, 61(8):2236–2242, 2014.
 Eckert et al. [2014] Yasuko Eckert, Nuwan Jayasena, and Gabriel H Loh. Thermal feasibility of diestacked processing in memory. 2014.
 [55] Facebook AI Research. Caffe2. https://caffe2.ai/.
 Krizhevsky [2014] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv, 2014.
 Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009. URL http://imagenet.org/.
 Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv, 2014.
 Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv, 2016.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.

Szegedy et al. [2015]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1–9, 2015.  He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
 Marcus et al. [1993] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 1993.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 1997.
 Zhou et al. [2016] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016.
 Mishra et al. [2017] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. WRPN: wide reducedprecision networks. arXiv, 2017.
 Li et al. [2016] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv, 2016.
 Zhang et al. [2018] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lqnets: Learned quantization for highly accurate and compact deep neural networks. arXiv preprint arXiv:1807.10029, 2018.
 [70] Nvidia tensor rt 5.1. https://developer.nvidia.com/tensorrt.
 Song et al. [2017] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. Pipelayer: A pipelined rerambased accelerator for deep learning. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pages 541–552. IEEE, 2017.
 NCSU [2018] NCSU. Freepdk45, 2018. URL https://www.eda.ncsu.edu/wiki/FreePDK45.
 [73] B. Murmann. ADC Performance Survey 19972016. murmann/adcsurvey.html, [Online]. Available. URL http://web.stanford.edu/.
 Harpe [2018] Pieter Harpe. A 0.0013 mm2 10b 10ms/s sar adc with a 0.0048 mm2 42dbrejection passive fir filter. In 2018 IEEE Custom Integrated Circuits Conference, CICC 2018. Institute of Electrical and Electronics Engineers Inc., 2018.
 Li et al. [2011] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTIP: Architecturelevel Modeling for SRAMbased Structures with Advanced Leakage Reduction Techniques. In ICCAD, 2011.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 Zmora et al. [2018] Neta Zmora, Guy Jacob, and Gal Novik. Neural network distiller, June 2018. URL https://doi.org/10.5281/zenodo.1297430.
 Long et al. [2018] Yun Long, Taesik Na, and Saibal Mukhopadhyay. Rerambased processinginmemory architecture for recurrent neural network acceleration. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (99):1–14, 2018.
 Crols and Steyaert [1994] Jan Crols and Michel Steyaert. Switchedopamp: An approach to realize full cmos switchedcapacitor circuits at very low power supply voltages. IEEE Journal of SolidState Circuits, 29(8):936–942, 1994.
 Fiorenza et al. [2006] John K Fiorenza, Todd Sepke, Peter Holloway, Charles G Sodini, and HaeSeung Lee. Comparatorbased switchedcapacitor circuits for scaled cmos technologies. IEEE Journal of SolidState Circuits, 41(12):2658–2668, 2006.
 Brodersen et al. [1979] Robert W Brodersen, Paul R Gray, and David A Hodges. Mos switchedcapacitor filters. Proceedings of the IEEE, 67(1):61–75, 1979.
 Bankman and Murmann [2016] Daniel Bankman and Boris Murmann. An 8bit, 16 input, 3.2 pj/op switchedcapacitor dot product circuit in 28nm fdsoi cmos. In SolidState Circuits Conference (ASSCC), 2016 IEEE Asian, pages 21–24. IEEE, 2016.
 Miyashita et al. [2017] Daisuke Miyashita, Shouhei Kousai, Tomoya Suzuki, and Jun Deguchi. A neuromorphic chip optimized for deep learning and cmos technology with timedomain analog and digital mixedsignal processing. IEEE Journal of SolidState Circuits, 52(10):2679–2689, 2017.
 Qiao et al. [2018] Ximing Qiao, Xiong Cao, Huanrui Yang, Linghao Song, and Hai Li. Atomlayer: a universal rerambased cnn accelerator with atomic layer computation. In Proceedings of the 55th Annual Design Automation Conference, page 103. ACM, 2018.
 Ji et al. [2018] Houxiang Ji, Linghao Song, Li Jiang, Hai Halen Li, and Yiran Chen. Recom: An efficient resistive accelerator for compressed deep neural networks. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pages 237–240. IEEE, 2018.
 Li et al. [2018] Bing Li, Linghao Song, Fan Chen, Xuehai Qian, Yiran Chen, and Hai Helen Li. Rerambased accelerator for deep learning. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pages 815–820. IEEE, 2018.
 Chen et al. [2017b] Lerong Chen, Jiawen Li, Yiran Chen, Qiuping Deng, Jiyuan Shen, Xiaoyao Liang, and Li Jiang. Acceleratorfriendly neuralnetwork training: learning variations and defects in rram crossbar. In Proceedings of the Conference on Design, Automation & Test in Europe, pages 19–24. European Design and Automation Association, 2017b.
 Chi et al. [2016b] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. Prime: A novel processinginmemory architecture for neural network computation in rerambased main memory. In ACM SIGARCH Computer Architecture News, volume 44, pages 27–39. IEEE Press, 2016b.
Comments
There are no comments yet.