I Introduction
Advances in highperformance computer architecture design has been a major driver for the rapid evolution of Deep Neural Networks (DNN). Due to their insatiable demand for compute power, naturally, both the research community [1, 3, 4, 5, 2, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28] as well the industry [29, 30, 31] have turned to accelerators to accommodate modern DNN computation. However, the algorithmic properties of DNNs have not fully been utilized to push the envelope on their acceleration efficiency and performance.
To that end, we leverage the following three algorithmic properties of DNNs to introduce a novel acceleration architecture, called . (1) DNNs are mostly a collection of massively parallel multiplyadds. (2) The bitwidth of these operations can be reduced with no loss in accuracy [32, 33, 34, 35, 36]. (3) However, to preserve accuracy, the bitwidth varies significantly across DNNs and may even be adjusted for each layer individually. Thus, a fixedbitwidth accelerator design would either yield limited benefits to accommodate the worstcase bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, introduces the concept of runtime bitlevel fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing a bitflexible accelerator, which comprises an array of processing engines that fuse at the bitlevel granularity to match the bitwidth of the individual DNN layers.
The bitlevel flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. As such, the following three insights both motivate and guide .
First, the number of bitlevel operations required for the multiply operator is proportional to the product of the operands’ bitwidths and scales linearly for the addition operator. Therefore, matching the bitwidth of the multiplyadd units to the reduced bitwidth of the DNN layers, almost quadratically reduces the bitlevel computations. This strategy will significantly affect the acceleration since the large majority of DNN operations ( 99%) are multiplyadds as shown in the table included in Figure 1. For instance, each single image classification with AlexNet [36] requires a total of 2682 million operations, of which 99.86% (2678 million) are multiplyadds. To this end, the compute units of can dynamically fuse or decompose to match the bitwidth of each individual multiplyadd operand without requiring the operands to be encoded in the same bitwidth.
Second, energy consumption for DNN acceleration is usually dominated by data accesses to onchip storage and offchip memory [4, 1, 3]. Therefore, comes with encoding and memory access logic that stores and retrieves the values in the lowest required bitwidth. This logic reduces the overall number of bits read or written to onchip and offchip memory, proportionally reducing the energy dissipation of memory accesses. Furthermore, this strategy increases the effective onchip storage capacity.
Third, builds upon the extensive prior work that shows DNNs can operate with reduced bitwidth without degradation in classification accuracy [32, 35, 34, 33, 37, 2]. This opportunity exists across different classes of realworld DNNs, as shown in Figure 1. One category is Convolutional Neural Networks (CNNs) that usually use convolution and pooling layers followed by a stack of fullyconnected layers. AlexNet, Cifar10, LeNet5, ResNet18, SVHN, and VGG7 in Figure 1 belong to this category. Recurrent Neural Networks (RNN) are another subclass of DNNs that use recurrent layers including Long Short Term Memory (LSTM) and vanilla RNN layers to extract temporal features from timevarying data. The RNN and LSTM benchmark DNNs in Figure 1 represent these categories. Furthermore, as the table in Figure 1 shows, most operations in DNNs ( 99%), regardless of their categories, are multiplyadds. As Figure 1(a) illustrates, on average, 97.3% of multiplyadds require four or fewer bits and even in some DNNs a large fraction of the operations can be done with bitwidth equal to one. More interestingly, the bitwidths vary within and across DNNs to guarantee no loss of accuracy. Such a variation is not limited to the intermediate operands and exists in trained weights as illustrated in Figure 1(b). To exploit this property, a programmable accelerator needs to offer bitlevel flexibility at runtime, which leads us to .
To harvest the aforementioned opportunities, this paper makes the following contributions and realizes a new dimension in the design of DNN accelerators.
Dynamic bitlevel fusion and decomposition. The paper introduces and explores the dimension of bitlevel flexible DNN accelerator architectures, , that dynamically matches bitlevel composable processing engines to the varying bitwidths required by DNN layers. By offering this flexibility, aims to minimize the computation and communication required by a DNN at the bit granularity on a per layer basis.
Microarchitecture design for bitlevel composability. To explore , we design and implement a DNN accelerator using a novel bitflexible computation unit, called . The accelerator supports both feedforward (CNN) and recurrent (LSTM and RNN) layers. A 2D array of constructs a fusible processing engine that can perform the DNN computation at various bitwidths. The microarchitecture also comes with a storage logic that allows feeding the with different bitwidth operands.
Hardwaresoftware abstractions for bitflexible acceleration. To enable DNN applications to take advantage of these unique bitlevel fusion capabilities, we propose a blockstructured instruction set architecture, called . To amortize the cost of programmability, expresses operations of DNN layers as bitflexible instruction blocks with iterative semantics.
These three contributions define the novel architecture of , a possible microarchitecture implementation, and the hardwaresoftware abstractions to offer bitlevel flexibility. Other complementary and inspiring works have explored bit serial computation [2, 6] without exploring the fusion dimension. In contrast, spatially fuses a group of together, to collectively execute operations at different bitwidths. Using eight realworld feedforward and recurrent realworld DNNs, we evaluate the benefits of . We implemented the proposed microarchitecture in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of to two stateoftheart DNN accelerators, Eyeriss [1] and Stripes [2]. The latter is an optimized bitserial architecture. In the same area, frequency, and technology node, offers speedup and energy savings over Eyeriss. Compared to Stripes [2], provides speedup and energy reduction at 45 nm node when area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, provides a speedup over the Jetson TX2 mobile GPU. Further, almost matches the performance of a 250Watt , which uses bit vector instructions, while merely consumes 895 milliwatts of power.
Ii Architecture
To minimize the computation and communication at the finest granularity, dynamically matches the architecture of the accelerator to the bitwidth required for the DNN, which may vary layer by layer, without any loss in accuracy. As such, is a collection of bitlevel computational elements, called , that dynamically compose to logically construct Fused Processing Engines () that execute DNN operations with the required bitwidth. Specifically, provide bitlevel flexibility for multiplyadds, which are the dominant operations across all types of DNNs. Below, we discuss how can be dynamically fused together to support a range of bitwidths, yet provide a significant increase in parallelism when operating at lower bitwidths.
Iia BitLevel Flexibility via Dynamic Fusion
As depicted in Figure 2, arranges the in a 2dimensional physical grouping, called . Each in a can perform individual binary (0, +1) and ternary (1, 0, +1) multiplyadd operations. As Figure 2 shows, the logically fuse together at runtime to form Fused Processing Engines () that match the bitwidths required by the multiplyadd operations of a DNN layer. The in a multiply an incoming variablebitwidth input (input forward) to a variablebitwidth weight (from WBUF) to generate the product. The then adds the product to an incoming partial sum to generate an outgoing partial sum (Psum forward in Figure 2(a)).
Figures 2(b), 2(c), and 2(d) show three different ways of logically fusing to form (b) 16 that support ternary (binary); (c) four that support mixedbitwidths (2bits for weights and 8bits for inputs), (d) one that supports 8bit operands, respectively. For binary or ternary operations (Figures 2(b)), each contains a single , offering the highest parallelism. The then adds the results from all and the incoming partial sum to generate a single outgoing partial sum. Figure 2(c) shows four fused together in a column to form a that can multiply 2bit weights with 8bit inputs. The bitwidths of operands supported by a depend on the spatial arrangement of fused together. Alternatively, by varying the spatial arrangement of the four fused , the can support 8bit/2bit, 4bit/4bit, and 2bit/8bit configurations for inputs/weights. Finally, up to 16 can fuse together to construct a single that can operate on 8bit operands for the multiplyadd operations (Figure 2(d)). The fuse together in powers of 2. That is, a single with 16 can offer 1, 2, 4, 8, and 16 with varying operand bitwidths. Dynamic composability of the at the bit level enables the architecture to expose the maximum possible level of parallelism with the finest granularity that matches the bitwidth of the DNN operands.
IiB Accelerator Organization
Two insights guide the architecture design of . First, DNNs offer high degrees of parallelism and benefit significantly from increasing the number of available within the accelerator’s area budget. Therefore, it is essential to minimize the overhead of control in the accelerator by not only maximizing the number of but also minimizing the overhead of dynamically constructing , thereby integrating the maximum number of in the area budget. Second, onchip SRAM and registerfile accesses dominate the energy consumption when accelerating DNNs [4, 1, 3]. Therefore, it is essential to reduce the number of bits exchanged with onchip and offchip memory while maximizing data reuse.
Systolic array. With these insights, we employ a 2dimensional systolic array of as the architecture for , as shown in Figure 3. The systolic organization reduces the overhead of control by sharing the control logic across the entire systolic array. More importantly, systolic execution alleviates the need for provisioning control for each as a dataflow architecture would have required. As such, the systolic architectures fit the most number of in a given area budget. Thus, the entire systolic array composed of acts as a single compute unit that can execute, for example, a single matrixvector multiplication operation with various bitwidths, which also sets the level of parallelism. In addition, the systolic organization of enforces sharing of input data across columns of the array and accumulates partial results across rows of the array to minimize access to onchip memory. As depicted in Figure 3, the input buffers (IBUFs) only located at the borders and feed the rows simultaneously. Similarly, the output buffers (OBUFs) reside on the bottom and collect the flowing results, which is accumulated by each column’s accumulator. As shown in Figure 3, each column harbors a pooling and an activation unit before its output buffer. Finally, the systolic organization also eliminates the need for local buffers for input, output, or partial results within . As such, each is accompanied by only a weight buffer (WBUF). Using as the building blocks, the performance of the systolic array maximally matches the bitwidths, with the highest performance at binary and ternary settings.
Memory organization. Depending on the number of and their organization, the buffers must supply different number of operands with various bitwidths. As such, we augment the input and the weight buffers with a register that holds a row of data that is gradually fed to the according to their bitwidth. As illustrated in Figure 3, a series of multiplexers after the register make this data infusion possible. The benefit of this design is avoiding multiple accesses to the data array of the buffer which conserves energy. With this design, at each cycle, the systolic array consumes a vector of inputs and matrix of weights to produce a vector of outputs with the fewest accesses to the buffers and the minimal bitwidth possible.
IiC Execution Model
Figure 4 illustrates the systolic execution in the mixedbitwidth mode using when an input vector is multiplied to a weight matrix. The input vector has 8bit elements that are being multiplied to a matrix with 2bit elements. As such, the 16in a logically compose to form four 82 . Both input and weight buffers provide 32 bits per access. The read values are split into 8bit input values and 2bit weight values in the output register of each buffer using its accompanying multiplexers as mentioned before. The input values are shared across the of each row and weight values are specific to each . As such, all of the work in parallel while only a single 32bit value is read from the input and weight buffers. Exploiting the lower bitwidth of weights, increases the level of parallelism by 4 while reducing the number of accesses to the weight buffer data arrays by the same factor of four. As discussed above, each adds the results of its with its incoming partials results and forwards the partial output to the underneath it. As shown in Figure 4, we support 32bit bitwidth for the partial and final results to avoid any inaccuracies.
Iii Bit Fusion MicroArchitecture
Given the overall organization of and its bitflexible systolic execution model, this section delves into the details of and . The key insight that enables bitlevel dynamic composability in is the mathematical property that a multiply operation between operands with powerof2 bitwidths (bit, bit, bit, and so on) can be decomposed to bit multiplications. The products from the decomposed multiplications can then be put together by shiftadd operations to generate the results of the original multiplication. The bitwidths of the operands dictates the number of decomposed multiplications required and the shift amounts that are applied to the decomposed products before addition. Using this insight, we design , the basic compute unit of the architecture, to support multiply operations for the smallest bitwidth of bits. The bit operands for a can be both signed or unsigned. Below, we describe the design of a single .
Iiia BitBrick Microarchitecture
Figure 5 shows the microarchitecture of a single . As shown, a takes as input two 2bit operands– and and two corresponding signbits– and . The signbits and define if the 2bit operands are signed (between 2 to 1) or unsigned (between 0 to 3). According to the signbit, the first extend the 2bit operands or to respectively create 3bit sign extended operands or . Finally, the employ a 3bit signed multiplier (shown with an encircled in Figure 5) to generate a 6bit product . Thus, a supports both signed and unsigned numbers as its inputs. The following subsection discusses how maps multiplyadd operations with varying bitwidths to .
IiiB Mapping Variable Bitwidth Operations to
To explain how compose to multiply operands with variable bitwidths, the discussion below uses a bit multiplication as an example. As mentioned, a multiply operation with powerof2 bitwidths can be decomposed to bit multiplies that can execute using . Figure 7(a) illustrates this mathematical property for a multiplication between bit operands and to produce . The bit multiplication in Figure 7(a) decomposes to four bit multiplications, shown in Figure 7(b). The decomposed multiplications execute using to generate decomposed products, as shown in Figure 7(c). The decomposed products require shifting before being put together. For a bit multiplication using , the results from the decomposed bit multiplications are leftshifted by , , , and , as shown in Figure 7(c).
Dynamic bitwidth flexibility. The bitwidths for the operands dictate how the results from the decomposed multiplications are leftshifted (multiplied with power of 2) before being added together. By adding flexibility in the shifting logic, the can support bit and even mixedbitwidth (bit bit) multiplications. Figure 7 shows the summation of two bit bit multiplications ( ). The operation in Figure 7 breaks down to four bit decomposed multiplications that map to four . Both the single bit bit operation in Figure 7(a) and the two bit bit operations in Figure 7 require the same number of . Therefore, the performance at bit bit is twice that of bit bit. The only difference between the operations in Figure 7(a) and Figure 7 is the shift amount required by the decomposed products. Similarly, when operating at bit bit, each can perform a single multiplication by setting the all the shift amounts to zero.
Supporting arbitrary bitwidths. The discussion so far shows how multiply operations between bit and bit operands map to . The same mathematical property can be recursively applied to support higher than bit for the operands. supports up to bit operands by first recursively breaking down the bit multiplication to bit, bit and then bit multiplications which can execute using . For a multiplication between bit operands and , the recursion can be expressed mathematically as follows.
(1)  
(2) 
and refer to the most significant and least significant bits of , respectively. By applying the above equation recursively, supports up to bit operands. When one of the operand’s bitwidths is larger, we use the formulation below.
(3) 
Each level of recursion, from bits to bits, bits to bits, and bits to bits, requires additional shiftadd logic. The overhead from the shiftadd logic represents the hardware cost of bitlevel flexibility. The next subsection details the design of a that uses to execute multiplyadds with variable bitwidths, up to 16bit.
IiiC MicroArchitecture
To enable bitlevel composability, introduces spatial fusion, a paradigm that spatially combines the decomposed products generated by multiple over a single cycle. Prior works [2, 38], on the other hand, devise a temporal design that use singlebit multiplyadd units independently over the span of multiple cycles. The following elaborates on these two approaches. To offer a fair comparison, we assume that even the temporal design uses 2bit multipliers, a configuration that provides a better area, delay, and power as opposed to a fully bitserial design.
Temporal design. Figure 10 shows a temporal design that can support variable bitwidths. The variablebitwidth multiply operation for the temporal design consists of three steps: (1) bit multiplication to generate a partial product, (2) shift operation to multiply with the appropriate power of , and (3) accumulation in a register. The temporal design requires cycles to execute a bit bit multiplication. The shift operation is simply a input multiplexer (mux). Compared to a fixed bit multiplier, the temporal design uses much smaller multiply units for bit operands, which require significantly less area. However, the number of gates required for the shifter and the accumulator depend on the highest supported bitwidth (16bit for ). For instance, to support up to 16bits using a temporal design, the shifter and the accumulator use up around of the area, which limits the benefits provided by this approach. Nevertheless, the temporal design reduces area consumption over a fixedbitwidth multiplier for the highest required bitwidth.
Spatial fusion. In contrast, our spatial multiplier spatially combines (or fuses) the results from four over a single cycle to execute either one bit bit multiplication, two bit bit multiplications, or four bit bit multiplications. Figure 10 illustrates the design of a spatial multiplier that supports up to bits for either of the two operands using . Similar to the temporal design, the spatial multiplier requires three steps: (1) multiplication using , (2) shiftadd using the shiftadd tree, and (3) accumulation of results in a register. The spatial multiplier improves upon the temporal design by using a shiftadd tree and a single shared accumulator to reduce the number of gates required. Each level of the shiftadd tree consists of three shiftunits and a fourinput adder that represent the multiplication with power of 2 in Equations (2) and (3). Compared to a bit fixed bitwidth multiplier the spatial multiplier requires more area but delivers higher performance for bit operations. Overall, spatial fusion provides higher compared to temporal design by packing more in the same area.
using spatiotemporal fusion. As discussed, a can execute variablebitwidth multiplyadd operations and supports bit to bit operands. Using Equations (2) and (3) recursively, we can realize a using either the temporal design, spatial fusion, or a combination of both. For a fixed area budget, using spatial fusion with 64 would pack the highest number of . At the same time, feeding the 64 for spatial fusion would require bit wide accesses to the SRAM buffers (IBUF and WBUF in Figure 3) per . Increasing the width of SRAMs increases the area required by the IBUF and WBUF. Therefore, we make a tradeoff wherein we use spatial fusion to combine spatially to realize support up to bit operands, and then combine it with temporal design to support up to bit operands over four cycles. This hybrid approach balances both bitlevel flexibility and the corresponding area overhead due to increased SRAM sizes. Figure 10 compares the area and the power requirements for a with 16 that uses the hybrid approach with a temporal design using 16 . As shown, for 16 , the hybrid has 3.5 less area and 3.2 less power compared to temporal design with the same number of 2bit multipliers.
Comparison to bitserial temporal execution. Prior works in [2], UNPU [39], and [38] devise bitserial computation as a means to support flexible bitwidths for DNN operations. Of the three, is a fullytemporal architecture, similar to the temporal design discussed above (Figure 10). and UNPU are hybrid designs that fix the bitwidth of one operand and support variable bitwidths for the other. We provide a headtohead comparison to in Section VB4 and provide a qualitative comparison to below. As the results from Figure 10 indicate, for the same throughput, a fullytemporal design, such as the one used in , would consume significantly larger area and power compared to our spatially composable . Furthermore, a fullytemporal design iterates in the form of a nested loop over the bits the two operands; hence, requiring more number of accesses to the SRAM.
The next section discusses the ISA, that exposes the bitlevel flexibility of to software.
Iv Instruction Set Architecture
To leverage the unique bitlevel flexibility of , we need to design a new hardwaresoftware interface that exposes those capabilities in an abstract manner. Furthermore, the abstraction must be flexible to enable a wide range of DNN models so as to exploit bitlevel fusion. The following lists the requirements for an ISA that provides this abstraction and enables efficient use of for various categories of DNNs.
Amortize the cost of bitlevel fusion by grouping operations. The operations in a DNN are organized into groups, called layers, wherein the same mathematical operation repeats a large number of times (often hundreds of thousands). To avoid the overhead of finegrained control over the operations at such a scale, the abstraction needs to amortize the cost of bitlevel fusion across blocks of instruction that implement the layers.
Enable a flexible datapath for . Both the number of words and the bitwidth of each word that feeds the varies depending on how the are composed as discussed in Section II. Thus, the semantics of instructions for data accesses must vary according to the fusion configuration to enable a flexible datapath.
Provide a concise expression for a wide range of DNN layers. As research in DNNs is still volatile, it is necessary to devise an ISA that is general enough to express a wide range of DNN operations/layers. Yet, minimizes the von Neumann overhead of instruction handling and require a small footprint.
Iva for BitFlexible Acceleration
Table I summarizes the ISA that aims to satisfy these requirements. The rest of this section discusses the instruction formats and provides the insight that drives them.
Blockstructured ISA for DNN layers. To leverage the commonalities in the operations of a layer, the ISA is block structured. As such, the fusion configuration of the is fixed across each block of instructions that implement a specific layer. In this work, we did not explore within layer bitwidth variations. Nevertheless, the ISA and this incarnation of its microarchitecture can readily support it by using multiple instruction blocks for an individual layer. The setup instruction marks the beginning of an instruction block and configures the and its data delivery logic to the specified bitwidth for the operands. This instruction effectively defines the logical fusion of the into for all the instructions in the block. The blockend instruction signifies the end of a block and provides the address to the next instruction in the nextinst field.
Concise expression of DNN layers. DNNs consist of a large number of simple operations like multiplyaccumulate and max, repeated over a large number of neurons (over 2600 million multiplyadds in AlexNet. See Table
II). Thus, the von Neumann overhead of instruction fetch and decode can limit performance due to the large number of operations required by a DNN. To minimize the number of instruction fetches/decodes required, we leverage the following insight. Each layer in a DNN is a series of simple mathematical operations over hyperdimensional arrays. How the operations walk through the array elements and the type of mathematical operation (multiplyadd/max) uniquely defines a layer. As such, the ISA provides loop instructions that enable a concise way of defining the walks and operations in a DNN layer. Each loop instruction has a unique ID in the block. As shown in Table I, the numiterations field in the loop instruction defines iteration count. The compute instruction specifies the type of operation, while the genaddr instruction dictates how to walk through the elements of the input/output hyperdimensional arrays. The stride field in the genaddr instruction specifies how to walk through the array elements in the loop, which is identified by the loopid field. The words after the setup instruction define the memory base address for the data that fills the three buffers of input, output, and weights. The genaddr instruction generates the addresses that walk through the memory data and fill the buffers.
(4) 
In Equation (4), is the loopid field of all the genaddr instruction in the block and the is the current iteration of the corresponding loops and their s. The fundamental assumption is that multiple genaddr instructions repeated by corresponding loop instructions define the complex multidimensional walks that expresses various kinds of DNN layers from LSTM to CNN. In the evaluated benchmarks, blocks with 3086 instructions are enough to cover LSTM, CNN, pooling, and fully connected. These blocks use a combination of loop, compute, and genaddr instructions to define these DNN layers nested loops. These statistics show that our ISA can concisely express various DNN layers while providing bitlevel fusion capabilities. Note that these instructions are fetched and decoded once at the beginning of an instruction block, amortizing the von Neumann overhead over the entire execution of the block.
Managing memory accesses for . The ldmem/stmem instructions exchange data between the onchip buffers (IBUF, OBUF, and WBUF) in Figure 3–and the offchip memory. Similarly, the rdbuf/wrbuf instructions read/write data from the onchip buffers specified by the scratchpadtype field as shown in Table I. In these four instructions, the size of the operands, which are variablebitwidth arrays, depends on the number of array elements and their bitwidths. These parameters, which control the logic that feeds the , are dependent on the bitlevel fusion configuration (number of in each ) and the type of data (input/weights). To capture this variation in the size of data, the semantics of rdbuf/wrbuf and ldmem/stmem instructions for accessing onchip and offchip memory vary according to the fusion configuration of their instruction block, set apriori. In particular, the sizes of memory accesses by ldmem/stmem instructions depend on both its numwords field and the fusion configuration defined by the corresponding setup instruction.
Decoupling onchip and offchip memory accesses. The data required by DNNs, and subsequently, the number of memory accesses are large. Hence, the latency due to offchip memory accesses can be a performance bottleneck. To hide the latency of offchip accesses, the ISA decouples the onchip memory accesses with offchip. Furthermore, decoupling the two types of memory accesses allows the accelerator to reuse onchip data using simple scratchpad buffers, instead of hardwaremanaged caches.
IvB Code Optimizations
As discussed in Section IV, the uses simple instructions combined with explicit loop instructions to express neural networks. The use of simpler instructions makes the ISA flexible to express a large range of DNNs. Nonetheless, the flexibility in the ISA enables incorporating layerspecific optimizations to improve the performance and energy gains. For brevity, we use an example fullyconnected layer to discuss the code optimizations. Figure 11 shows the matrixmatrix multiplication associated with this example. We perform the following three optimizations as depicted in Figure 12.
Loop ordering. Loopordering optimizes the order of the outer loops and memory instructions to further reduce offchip accesses. Recall that ISA uses loop indices to generate memory addresses (Section IV). When the address for a memory instruction does not depend on the index of the previous loop instruction, their order can be exchanged. The optimized code in Figure 12(b) uses OutputStationary for executing the fullyconnected layer, to reduce read/write accesses to the output buffer. Changing the order allows to switch between InputStationary, OutputStationary, and WeightStationary to minimize offchip and onchip accesses.
Loop tiling. Looptiling partitions a loop instruction in the ISA into smaller tiles such that the data required by a loop operation fits inside the onchip scratchpads. The smaller tiles are accessed using a single LD/ST instruction and are reused in the innerloop to reduce offchip accesses. Compared to the original code in Figure 12(a), the tiled version in Figure 12(b) reduces offchip accesses for output buffer by a factor of IC, and onchip accesses for output buffer by a factor of . Note that IC is a dimension in the matrix multiplication operation as depicted in Figure 11. Convolution layers typically require six loop instructions, which increases to 12 after tiling optimizations. The overhead of increasing the number of instructions on performance is negligible since the cost of fetch and decode is amortized throughout the execution of the layer.
Layer fusion. As discussed, the architecture consists of a 2D systolic array of multipliers, along with a 1D array of pooling/activation units. When two or more consecutive layers use mutually exclusive onchip resources, the instructions for the two layers are combined such that the data produced by the first layer is directly fed into the subsequent layer, avoiding costly offchip accesses. For example, the fullyconnected layer in Figure 11 uses the 2D systolic array. If the next layer is activation, then we can fuse the layers and create one block of instruction for computing both the layers.
V Evaluation
Va Methodology
Benchmarks. Table II shows the list of 8 CNN and RNN benchmarks from diverse domains including image classification, object and optical character recognition, and language modeling. The selected DNN benchmarks use a diverse size of input data, which allows us to evaluate the effect of input data size on the architecture. AlexNet [40, 36], SVHN [41, 35], CIFAR10 [42, 35], LeNet5 [43, 34], VGG7 [44, 34], ResNet18 [45, 36]
are popular and widelyused CNN models. Among them, AlexNet and ResNet18 benchmarks are image classification applications that have different network topologies that use the ImageNet dataset. The SVHN and LeNet5 benchmarks are optical character recognition applications that recognize the house numbers from the house view photos and handwritten/machineprinted characters, respectively. CIFAR10 and VGG7 are object recognition applications based on the CIFAR10 and ImageNet dataset, respectively. The RNN
[35] and LSTM [46, 35] are recurrent networks that perform language modeling on the Penn TreeBank dataset [47]. In Table II, the “MultiplyAdd Operations” column shows the required number of MultiplyAdd operations for each model and the “Model Weights” column shows the size of model parameter. Note that the multiplyadd operations and model weights have variable bitwidths as presented in Figure 1.Reduced bitwidth DNN models. aims to accelerate the inference of a wide range of DNN models with varying bitwidth requirements, with no loss in classification accuracy. The benchmarks, listed in Table II, employ the model topologies proposed in prior work [32, 35, 34, 36]
that train low bitwidth DNNs and achieve the same accuracy as the 32bit floatingpoint models. We did not engineer these quantized DNNs and merely took them from the existing deep learning literature
[32, 35, 34, 36]. Benchmarks Cifar10, SVHN, LSTM, and RNN use the quantized models presented in [35]. Benchmarks LeNet5 and VGG7 use ternary (+1,0,1) networks [34]. AlexNet and ResNet18 use the bit wide models presented in [36] that double the number of channels for convolution and fullyconnected layers. We use the regular AlexNet and ResNet18 models for and the GPU baselines, and use their wide quantized models for and .Accelerator development and synthesis. We use RTLVerilog to implement the configuration of the architecture and verify the design through extensive RTLsimulations. We synthesize at 45nm technology node using Synopsys Design Compiler (L2016.03SP5) and a commercial standardcell library. Design Compiler provides the chip area, achievable frequency, and dynamic/static power, which we use to estimate the performance and energyefficiency of the accelerator.
Simulation infrastructure for . We compile each DNN benchmark to the instructions of the (Section IV). We develop a cycleaccurate simulator that takes the instructions for the given DNN and simulates the execution to calculate the cycle counts as well as the number of accesses to onchip buffers (IBUF, OBUF, and WBUF in Figure 3) and offchip memory. We verify the cycle counts of the simulator against our Verilog implementation of the architecture. Using the frequency defined in Table III and the cycle counts, the simulator measures the execution time of the architecture. To evaluate the energy efficiency, we model the energy consumption for onchip buffers for the accelerator using the results from CACTIP [48].
Comparison with Eyeriss. To measure the performance and energy dissipation of our comparison point, , we use their opensource simulation infrastructure [4]. The resulting area and energy metrics are shown in Table III. As mentioned, we use the same area budgets as , which is 1.1 mm for compute units and 5.87 mm for chip to synthesize , shown in Table III. We use a total 112 KB SRAM for onchip buffers (IBUF, OBUF, and WBUF in Figure 3). Eyeriss operates on the 16bit operands and supports flexible bitwidths from 2, 4, 8, to 16 bits.
Comparison with Stripes. The authors of Stripes graciously shared their simulator [2]. Their power estimation tools were in 65 nm node, which we scaled to 45 nm. operates on 16bit inputs and variablebitwidth weights (1 through 16), using Serial InnerProduct units (SIPs). is organized into 16 tiles each of which has 4096 SIPs. For a fair comparison, we replace the 4096 SIPs in each tile of with our proposed systolic array with 512 , each with 16 to match the same budget of 1.1mm for compute, which is the area after scaling to 45 nm and use the same total onchip memory.
Comparison with GPUs. We use two GPUs (and Tegra X2) based on Nvidia’s Pascal architecture to compare with . Table III shows the details of the two GPUs. We use Nvidia’s custom TensorRT 4.0 [49] library compiled with the latest CUDA 9.0 and cuDNN 7.1 which support 8bit quantized calculations, the smallest possible in the architecture. Across GPU platforms, we use 1,000 warmup batches, followed by 10,000 batches to measure performance and use the average. For a headtohead comparison, we conservatively scale to 16 nm technology node assuming a voltage scaling and capacitance scaling according to the methodology presented in [50]. However, we assume the same frequency of 500 MHz as and do not increase the frequency. The scaled architecture has 4096 with 896 KB SRAM and has a total chip area of 5.93 mm and consumes 895 milliwatts of power. As a point of reference, in the same 16 nm node, has a chip area of 471 mm and has a TDP of 250 Watts, as summarized in Table III.
VB Experimental Results
VB1 Comparison to
Performance and energy improvement. To evaluate the performance and efficiency benefits from the architecture, we compare with a stateoftheart accelerator [1] that proposes an optimized dataflow architecture for DNNs. We match the same area budget of 1.1mm for computational logic across both architectures: systolic array in and PEs in , and match the total SRAM capacity. We scale the area and energy consumption of the PEs, registerfiles, onchip network, and DRAM in to 45nm technology according to the methodology proposed in [4]. For a fair comparison between the two architectures, we use the same frequency of 500MHz reported in the paper [4] for both and . Figure 13 presents the performance and energy benefits of in comparison with . On average, delivers speedup since the architecture can perform more DNN operations with lower bitwidth in a given area compared to . Depending on the types of DNN operations and the required bitwidths, the benchmarks see different performance gains. The CNN benchmarks (AlexNet, SVHN, Cifar10, LeNet5, VGG7, and ResNet18) see higher performance gains than the recurrent networks (RNN and LSTM) since the convolution operations are more amenable for data reuse in systolic architecture of . Cifar10 sees the highest benefits of 13 speedup since most of its operations can be computed with the smallest bitwidth (1bit input and 1bit weight) and its operations provide a large degree of parallelism that can exploit the increased number of . In contrast, ResNet18 and AlexNet achieve the lowest speedup of 1.9, because these two benchmarks use twice the number of channels ( wide) for convolution and fullyconnected layers [36] for quantized execution on . We use the original AlexNet and ResNet18 models on , which effectively requires less multiplyadd operations. Overall, using variable bitwidth improves performance and energy efficiency, since it increases compute capacity and reduces active hardware components. Figure 13 also shows the energy reduction. The average improvement is , with the largest of 14 from Cifar10 and the smallest of 1.5 from AlexNet. The significant energy reduction attributes to both organizations and memory access reductions, which we discuss below in more detail.
Energy breakdown.
To understand the sources of the energy reduction, we break down the energy consumptions for each hardware component (compute units, onchip SRAM buffers, register file, and offchip DRAM memory). Figure 14 shows the percomponent energy dissipation for and . This figure should be considered with the energy reduction results from Figure 13. Both accelerators consume more than 80% of energy for onchip and offchip memory accesses. The bitlevel flexibility for memory accesses in significantly reduces energy consumption for both onchip buffers (IBUF, OBUF, and WBUF in Figure 3) and offchip DRAM. Furthermore, with bitlevel flexibility, our buffers can hold more data at lowerbitwidths, effectively giving more onchip storage capacity, which leads to fewer offchip memory accesses. employs local register files within each PE, which constitutes a significant portion of the energy consumption. ’s systolic architecture avoids the need for register files and enforces explicit data sharing for inputs and partial results, as shown in Figure 3. Therefore, saves on Register File energy, but requires more SRAM accesses. The combined effect of bitlevel flexibility and the systolic organization of in the architecture provides an average energy savings of . Offchip DRAM accesses, however, are still a significant portion of ’s energy consumption and its share grows due to the significant reduction of compute and onchip storage energy.
VB2 Sensitivity Study
Sensitivity to memory bandwidth.
Depending on the DNN topology, the impact of offchip bandwidth on performance varies. To understand the correlation between bandwidth and performance, we perform a sensitivity study for bandwidth. Figure 15 shows the performance improvements with as we change the bandwidth from 0.25 to 4 of the default value. The baseline in this study the with the default bandwidth of 128 bits per cycle. On average, when we scale the bandwidth up to 4, provides 1.6 speedup compared to the default setting, while with 0.25 bandwidth, the performance degrades 60%. Since CNN benchmarks see more opportunities for data reuse, they have less sensitivity to the bandwidth compared to the RNN benchmarks. The two RNN benchmarks, LSTM and RNN, provide almost linearlyscaling speedup as they are bottlenecked by the bandwidth.
Sensitivity to batch size. Batching amortizes the cost of weight reads by sharing weights across a batch of inputs. Figure 16 shows how performance changes as we increase batch size from 1 through 256 with the batch size 1 as the baseline (no batching). Our default batch size is 16. On average, with the batch size of 256 engenders 2.7 speedup with the highest speedup of 21.4 from RNN. Since batching is effective when the bandwidth is limited and the performance is bandwidthbound, the trends are similar to the bandwidth sensitivity results presented in Figure 15. However, there is a marginal gain across all the benchmarks when the batch is increased from 64 to 256, since beyond a batch size of 64, the bandwidth is sufficient to keep all the occupied.
VB3 Comparison to GPUs
Performance comparison to GPUs.
GPUs are the most widelyused generalpurpose processors for DNNs. We compare the performance of accelerators with two GPUs: (1) Tegra X2 (TX2), and (2) Titan X based on the Pascal architecture (), details of which are presented in Table III. As mentioned in the methodology section VA, we scale to match the 16 nm technology node of the GPUs, and use a total of 4096 . Figure 17 shows the speedup of TitanX and using the TX2 as the baseline. TX2 does not support 8bit mode natively. Due to this lack of support, empirical results show slow down when the 8bit instruction are used in TX2. As Figure 17 depicts, TitanX in singleprecision floating point (FP32), is, on average, faster than TX2. The speedup grows to when 8bit mode is used. While GPUs can benefit from using as low as 8bits, can extract performance benefits for as low as 2bit operations. Using bitlevel composability, provides a speedup over TX2. The VGG7 benchmark sees the maximum gains of 30 and 48 performance from and , respectively. The high degrees of parallelism in VGG7 enables both and to utilize all the available onchip compute resources. , while consuming 895 milliwatts of power, is only 16 slower than the 250Watt that uses bit computations, almost matching its performance.
VB4 Comparison to
Performance compared to Stripes. Figure 18 presents the performance and energy benefits of in comparison with . On average, provides speedup over . uses bitserial computations to support variable bitwidths just for DNN weights. As opposed to , the architecture offers dynamically composable to support flexible bitwidths for both inputs and weights in DNNs. achieves the highest speedup of 5.2 and lowest speedup of 1.8 over for benchmarks LeNet5 and AlexNet, respectively. ResNet18 which is the most recent and the biggest of the benchmarks sees 2.6 performance benefits as it can use low bitwidth on both operands. AlexNet uses 8bit inputs/weights for the first convolution layer and the last fullyconnected layer. The two 8bit layers limit the benefits of over . Benchmark LeNet5, on the other hand, uses low bitwidths for both inputs and weights, resulting in the highest performance benefits with .
Energy reduction compared to Stripes. Figure 18 also depicts the improvement in energy when is compared to . As mentioned, benefits from reduction in both computation and memory access at lower bitwidths for both inputs and weights. On average, reduces energy consumption by over . LeNet5 sees the highest energy reduction of 7.8, while benchmark AlexNet sees the least energy reduction of 2.7 over . For ResNet18, the energy is reduced by a factor of 4.
offers a fundamentally different approach from and explores the dimension of bitlevel dynamic composabililty, which significantly improves performance and energy.
Vi Related Work
A growing body of related works develop DNN accelerators. fundamentally differs from prior work as it introduces and explores a new dimension of bitlevel composable architectures that can dynamically match the bitwidth required by DNN operations. aims to minimize both computations and communications in the finest granularity possible without compromising on the DNN accuracy. Below, we discuss the most related work.
Precision flexibility in DNNs. [2] and Tartan [6] use bitserial compute units to provide precision flexibility for inputs at the cost of additional area overhead. Both works provide performance and efficiency benefits that are proportional to the precision reduction for inputs. We directly compare the benefits of to in Section V. UNPU [39] fabricates a bitserial DNN accelerator at 65 nm, similar to [2]. [38] uses bitserial computation for precision flexibility. [51] skips stages of a fullypipelined floatingpointmultiplier to perform either one 16bit, two 12bit, or four 8bit multiplications. In contrast, the are spatial designs that use combinational logic to dynamically compose and decompose 2bit multipliers () to construct variable bitwidth multiplyadd units. Moons et al. propose aggressive voltage scaling techniques at low precision for increased energy efficiency at constant throughput by turning off parts of the multiplier [37, 52]. As such, they do not offer fusion capabilities. TPU [30] proposes a systolic architecture for DNNs and supports 8bit and 16bit precision. This work, on the other hand, proposes an architecture that dynamically composes lowbitwidth compute units () to match the bitwidth requirements of DNN layers.
Binary DNN accelerators. Several inspiring works have explored ASIC and FPGA accelerators optimized for Binary DNNs. FINN [53] uses FPGAs for accelerating Binary DNNs, while YodaNN [54] and BRein [55] propose an ASIC accelerator for binary DNNs. Kim, et al. [56] decompose the convolution weights for binary CNNs to improve performance and energy efficiency. The above works focus solely on binary DNNs to achieve high performance at the cost of classification accuracy. , on the other hand, flexibly matches the bitwidths of DNN operations for performance/energy benefits without losing accuracy.
Sparse Accelerators for DNNs. EIE [5], CambriconX [15], Cnvlutin [13], and SCNN [57] explore the sparsity in the DNN layers and use zeroskipping to provide performance and energyefficiency benefits. Orthogonal to the works above, explores the dimension of bitflexible accelerators for DNNs.
Other ASIC accelerators for DNNs. DaDianNao [7] uses eDRAM to eliminate offchip accesses and provide high performance and efficiency for DNNs. PuDianNao [9]
is an accelerator designed for machine learning, but does not support CNNs. Minerva
[12] proposes operation pruning and data quantization techniques to reduce power consumption for ASIC acceleration. [1, 3] presents an optimized rowstationary dataflow for DNNs to improve efficiency. Tetris [4] and Neurocube [11] propose 3D stacked DNN accelerators to provide high bandwidth for DNN operations. ISAAC [26], PipeLayer [28], and Prime [27] use resistive RAM (ReRAM) for accelerating DNNs. Ganax [58] uses a SIMDMIMD architecture to support DNNs and generative models. Snapea [59] employs early termination to skip computations.Instruction Sets for DNNs. Cambricon [14] provides an ISA to express the different computations in a DNN using vector and matrix operations without significant loss in efficiency over DaDianNao. DnnWeaver [22] proposes a coarse grained ISA to express layers of DNNs, which are first translated to microcodes for FPGA acceleration. Unlike prior work, the proposed in the work is designed to enable bitlevel flexibility for accelerating DNNs. Further, the uses loop instructions with iterative semantics to significantly reduce instruction footprint.
Code optimization techniques. Alwani, et. al [60] propose layerfusion, that combines multiple convolutional layers to save offchip accesses for FPGA acceleration of CNNs. Escher [61] proposes a CNN FPGA accelerator using flexible buffering that balances the offchip accesses for inputs and weights in CNNs. The above works have inspired the codeoptimizations explored in this paper, however, the key contribution of this work is a bitlevel flexible DNN accelerator.
Software techniques for Binary/XNOR DNNs. QNN
[35] shows that efficient GPU kernels for XNORbased binary DNNs can provide up to 3.4 improvement in performance. XNORNet [62] shows that specialized libraries for Binary/XNORnets can achieve 58 performance on CPUs. In contrast, is an ASIC accelerator architecture that supports a wide range of bitwidths (binary to 16bits) for DNNs with no accuracy loss.Core Fusion and CLPs. Core Fusion [63] and CLPs [64] are dynamically configurable chip multiprocessors that a group of independent processors can fuse and form a more capable CPU. In contrast to these inspiring works, performs the composition in the bit level rather than at the level of fullfledged cores.
Vii Conclusion
Deep neural networks use abundant computation, but can withstand very low bitwidth operations without any loss in accuracy. Leveraging this property of DNNs, we develop , a bitlevel dynamically composable architecture, for their efficient acceleration. The architecture comes with an ISA that enables the software to utilize this bitlevel fusion capability to maximize the parallelism in computations and minimize the data transfer in the finest granularity possible. We evaluate the benefits of by synthesizing the Verilog implementation of the proposed microarchitecture in 45 nm technology node and using cycle accurate simulations with eight realworld DNNs that require different bitwidths in their layers. achieves significant speedup and energy benefits compared to stateoftheart accelerators.
Viii Acknowledgments
We thank Amir Yazdanbaksh, Divya Mahajan, Jacob Sacks, and Payal Preet Bagga for insightful discussions and comments. This work was in part supported by NSF awards CNS#1703812, ECCS#1609823, Air Force Office of Scientific Research (AFOSR) Young Investigator Program (YIP) award #FA95501710274, and gifts from Google, Microsoft, Xilinx, and Qualcomm.
References
 [1] Y.H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks,” in ISCA, 2016.
 [2] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes: Bitserial deep neural network computing,” in MICRO, 2016.
 [3] Y.H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” JSSC, 2017.
 [4] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleration with 3d memory,” in ASPLOS, 2017.
 [5] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in ISCA, 2016.
 [6] A. Delmas, S. Sharify, P. Judd, and A. Moshovos, “Tartan: Accelerating fullyconnected and convolutional layers in deep learning networks by exploiting numerical precision variability,” arXiv, 2017.
 [7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A machinelearning supercomputer,” in MICRO, 2014.
 [8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning,” in ASPLOS, 2014.
 [9] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,” in ASPLOS, 2015.
 [10] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: shifting vision processing closer to the sensor,” in ISCA, 2015.
 [11] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with highdensity 3d memory,” in ISCA, 2016.
 [12] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernándezLobato, G.Y. Wei, and D. Brooks, “Minerva: Enabling lowpower, highlyaccurate deep neural network accelerators,” in ISCA, 2016.
 [13] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: ineffectualneuronfree deep neural network computing,” in ISCA, 2016.
 [14] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in ISCA, 2016.
 [15] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambriconx: An accelerator for sparse neural networks,” in MICRO, 2016.
 [16] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240 gops/s mobile coprocessor for deep neural networks,” in CVPRW, 2014.
 [17] J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim, “14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems,” in ISSCC, 2016.
 [18] F. Conti and L. Benini, “A ultralowenergy convolution engine for fast braininspired vision in multicore clusters,” in DATE, 2015.
 [19] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deepburning: Automatic generation of fpgabased learning accelerators for the neural network family,” in DAC, 2016.
 [20] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “Cbrain: A deep learning accelerator that tames the diversity of cnns through adaptive datalevel parallelization,” in DAC, 2016.
 [21] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpgabased accelerator design for deep convolutional neural networks,” in FPGA, 2015.
 [22] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, “From highlevel deep neural models to fpgas,” in MICRO, 2016.
 [23] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fusedlayer cnn accelerators,” in MICRO, 2016.
 [24] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.s. Seo, and Y. Cao, “Throughputoptimized openclbased fpga accelerator for largescale convolutional neural networks,” in FPGA, 2016.
 [25] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for convolutional neural network,” in FPGA, 2016.
 [26] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars,” in ISCA, 2016.
 [27] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novel processinginmemory architecture for neural network computation in rerambased main memory,” in ISCA, 2016.
 [28] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined rerambased accelerator for deep learning,” in HPCA, 2017.
 [29] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman, C. Boehn, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, T. Juhasz, R. K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, S. Reinhardt, A. Sapek, R. Seera, B. Sridharan, L. Woods, P. YiXiao, R. Zhao, and D. Burger, “Accelerating persistent neural networks at datacenter scale,” in HotChips, 2017.

[30]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, et al.
, “Indatacenter performance analysis of a tensor processing unit,” in
ISCA, 2017.  [31] “Apple a11bionic.” https://en.wikipedia.org/wiki/Apple_A11.
 [32] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, “Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv, 2016.
 [33] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv, 2016.
 [34] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv, 2016.
 [35] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv, 2016.
 [36] A. K. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, “WRPN: wide reducedprecision networks,” arXiv, 2017.
 [37] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Dvafs: Trading computational accuracy for energy through dynamicvoltageaccuracyfrequencyscaling,” in DATE, 2017.
 [38] S. Sharify, A. D. Lascorz, P. Judd, and A. Moshovos, “Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks,” arXiv, 2017.
 [39] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.J. Yoo, “Unpu: A 50.6 tops/w unified deep neural network accelerator with 1bto16b fullyvariable weight bitprecision,” in ISSCC, 2018.
 [40] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv, 2014.
 [41] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, 2011.
 [42] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Computer Science Department, University of Toronto, Tech. Rep, 2009.
 [43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv, 2014.
 [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
 [46] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, 1997.
 [47] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a large annotated corpus of english: The penn treebank,” Computational linguistics, 1993.
 [48] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTIP: Architecturelevel Modeling for SRAMbased Structures with Advanced Leakage Reduction Techniques,” in ICCAD, 2011.
 [49] “Nvidia tensor rt 4.0.” https://developer.nvidia.com/tensorrt.
 [50] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in ISCA, 2011.
 [51] T. Rzayev, S. Moradi, D. H. Albonesi, and R. Manohar, “Deeprecon: Dynamically reconfigurable architecture for accelerating deep neural networks,” IJCNN, 2017.
 [52] B. Moons and M. Verhelst, “A 0.3–2.6 tops/w precisionscalable processor for realtime largescale convnets,” in VLSICircuits, 2016.

[53]
Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Finn: A framework for fast, scalable binarized neural network inference,” in
FPGA, 2017.  [54] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An ultralow power convolutional neural network accelerator based on binary weights,” arXiv, 2016.

[55]
K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, M. Ikebe,
T. Asai, S. TakamaedaYamazaki, T. Kuroda, et al.
, “Brein memory: A 13layer 4.2 k neuron/0.8 m synapse binary/ternary reconfigurable inmemory deep neural network accelerator in 65 nm cmos,” in
VLSI, 2017.  [56] H. Kim, J. Sim, Y. Choi, and L.S. Kim, “A kernel decomposition architecture for binaryweight convolutional neural networks,” in DAC, 2017.
 [57] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressedsparse Convolutional Neural Networks,” in ISCA, 2017.
 [58] A. Yazdanbakhsh, H. Falahati, P. J. Wolfe, K. Samadi, H. Esmaeilzadeh, and N. S. Kim, “GANAX: A Unified SIMDMIMD Acceleration for Generative Adversarial Network,” in ISCA, 2018.
 [59] V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. K. Gupta, “Snapea: Predictive early activation for reducing computation in deep convolutional neural networks,” in ISCA, 2018.
 [60] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fusedlayer cnn accelerator,” in MICRO, 2016.
 [61] Y. Shen, M. Ferdman, and P. Milder, “Escher: A cnn accelerator with flexible buffering to minimize offchip transfer,” in FCCM, 2017.
 [62] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” arXiv, 2016.
 [63] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, “Core fusion: accommodating software diversity in chip multiprocessors,” in ISCA, 2007.
 [64] C. Kim, S. Sethumadhavan, M. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler, “Composable lightweight processors,” in MICRO, 2007.