1 Introduction
Convolutional neural networks (CNNs) excel at various important applications, e.g., image classification, image segmentation, robotics control, and natural language processing. However, their high computational complexity necessitates speciallydesigned accelerators for efficient processing. Training of CNNs requires an enormous amount of computing power to automatically learn the weights based on a large training dataset. Few ASICbased CNN training accelerators have been presented
[10, 64, 69]. However, graphical processing units (GPUs) typically play a dominant in the training phase as CNN computation essentially maps well to their singleinstruction multipledata (SIMD) units and the large number of SIMD units present in GPUs provide significant computational throughput for training CNNs [18, 11]. In addition, the higher clock speed, bandwidth, and power management capabilities of the Graphics Double Data Rate (GDDR) memory relative to the regular DDR memory make GPUs the de facto accelerator choice for CNN training. On the other hand, CNN inference is more latency and powersensitive as an increasing number of applications need realtime CNN evaluations on batteryconstrained edge devices. Hence, ASIC and FPGAbased accelerators have been widely explored for this purpose. However, they can only process lowlevel CNN operations, such as convolution and matrix multiplication, and lack the flexibility of a generalpurpose processor. Although CNN models have evolved rapidly recently, their fundamental building blocks are common and longlasting. Therefore, the ASIC and FPGAbased accelerators can efficiently process new CNN models with their domainspecific architectures. FPGAbased accelerators achieve faster timetomarket and enable prototyping of new accelerator designs. Microsoft has used customized FPGA boards, called Catapult [47], in its data centers to accelerate Bing ranking by 2. An FPGAbased CNN accelerator that uses onchip memory has been proposed in [45], where a fixedpoint representation is used to keep all the weights stored in onchip memory thus avoiding the need to access external memory. To improve dynamic resource utilization of FPGAbased CNN accelerators, multiple accelerators, each specialized for a specific CNN layer, have been constructed using the same FPGA hardware resource [55]. A convolver design for both the convolutional (CONV) layer and fullyconnected (FC) layer has been proposed in [48] to efficiently process CNNs on embedded FPGA platforms. ASICbased CNN accelerators have better energy efficiency and can be fully customized for CNN applications. In [17], CNNs are mapped entirely within the onchip memory and the ASIC accelerator is placed close to the image sensor so that all the DRAM accesses are eliminated, leading to a 60 energy efficiency improvement relative to previous works. A 1D chain ASIC architecture is used in [66] to accelerate the CONV layers since these layers are the most computeintensive [10]. To speed up the CONV layers, a Fast Fourier Transformbased fast multiplication is used in
[15]. This accelerator encodes weights using blockcirculant matrices and converts convolutions into matrix multiplications to reduce the computational complexity from to and storage complexity from to . A 3D memory system is used in [21] to reduce memory bandwidth pressure. This enables more chip area to be devoted to processing elements (PEs), thus increasing performance and energy efficiency.To take advantage of the underlying parallel computing resources of CNN accelerators, an efficient dataflow is necessary to minimize data movement between the onchip memory and PEs. Unlike the temporal architectures, like SIMD or singleinstruction multiplethread, used in central processing units (CPUs) and GPUs, the Google Tensorflow processing unit (TPU) uses a spatial architecture, called the systolic array
[33]. Data flow into arithmetic logic units (ALUs) in a wave and move through adjacent ALUs in order to be reused. In [4], multiple CNN layers are fused and processed so that intermediate data can be kept in onchip memory, thus obviating the need for external memory access. A finegrained dataflow accelerator is proposed in [68]. It converts convolution into data preprocessing and matrix multiplication. Data are directly transferred among PEs, without the need for redundant control logic, as opposed to temporal architectures, such as DianNao [8]. In [39], a flexible dataflow architecture is described for efficiently processing different types of parallelism: feature map, neuron, and synapse. A dataflow called rowstationary is used in
[9] to reuse data and minimize data movement on a spatial architecture.Although various dataflow styles and computational parallelism designs have been explored in recent works, potential speedup from weight/activation sparsity is still underexplored. The computation and memory footprint of CNNs can be significantly reduced if sparsity is exploited during network evaluations. Some recent works utilize sparsity to speed up CNN evaluations [24, 75, 44, 28, 2, 74]. However, they only consider either activation or weight sparsity, and only use sparsity during CNN inference based on various pruning methods. It has been shown that the average networkwide activation sparsity of the wellknown AlexNet CNN [35] during its entire training process is 62% (a maximum of 93%) [49]. Therefore, the training process can be significantly accelerated if sparsity is exploited. Another CNN acceleration technique is to use reduced precision to improve performance and energy efficiency. For example, TPU and DianNao use 8bit and 16bit fixedpoint quantizations, respectively, in CNN evaluations. However, lowprecision accelerators are currently mainly used in the inference phase, since CNN training involves gradient computation and propagation that require highprecision floatingpoint operations to achieve high accuracy. Apart from improving the efficiency of the computational resources employed in CNN accelerators, CNN training also requires a large memory bandwidth to store activations and weights. In the forward pass, activations must be retained in the memory until the backward pass commences in order to compute the error gradients and update weights. Besides, in order to fill the SIMD units of GPUs, a large amount of data is needed from the memory. Hence, 3D memory systems, such as hybrid memory cube (HMC) [29] and high bandwidth memory (HBM) [52], have been used in highend GPUs to provide significant memory bandwidth for CNN training.
In this article, we make the following contributions:
1) We propose a novel sparsityaware CNN accelerator architecture, called SPRING. It encodes
activation and weight sparsities with binary masks and uses efficient lowoverhead hardware
implementations for CNN training and inference.
2) SPRING uses reducedprecision fixedpoint operations for both training and inference. A
dedicated module is used to implement the stochastic rounding algorithm [23] to prevent
accuracy loss during CNN training.
3) SPRING uses an efficient monolithic 3D nonvolatile RAM (NVRAM) interface to provide
significant memory bandwidth for CNN processing. This alleviates the performance bottleneck in
CNN training since the training process is usually memorybound [40].
We test the proposed SPRING architecture on seven wellknown CNNs in the context of both training and inference. Simulation results show that the average execution time, power dissipation, and energy consumption are reduced by 15.6, 4.2, and 66.0, respectively, for CNN training, and 15.5, 4.5, and 69.1, respectively, for inference, relative to Nvidia GeForce GTX 1080 Ti.
The rest of the article is organized as follows. Section 2 discusses the background information required to understand our sparsityaware accelerator. Section 3 presents the sparsityaware reducedprecision accelerator architecture. Section 4 describes our simulation setup and flow. Section 5 presents experimental results obtained on seven typical CNNs. Section 6 discusses the limitations of our work. Section 7 concludes the article.
2 Background
In this section, we discuss the background material necessary for understanding our proposed sparsityaware reducedprecision accelerator architecture. We first give a primer on CNNs. We then discuss existing sparsityaware designs. Then, we discuss various CNN training algorithms that use low numerical precision. Finally, we describe an efficient onchip memory interface that is used for CNN acceleration.
2.1 CNN overview
Although different CNNs have different hyperparameters, such as the number of layers and shapes, they share a similar architecture, as shown in Fig.
1. CNNs are generally composed of five building blocks: CONV layers, activation (ACT) layers, pooling (POOL) layers, batch normalization layers (not shown in Fig.
1), and FC layers. Among these basic components, the CONV and FC layers are the most computeintensive [10]. We describe them next.CONV layers: a batch of 3D input feature maps are convolved with a set of 3D filter weights to generate a batch of 3D output feature maps. The filter weights are usually fetched from external memory once and stored in onchip memory as they are shared among multiple convolution windows. Therefore, CONV layers have relatively low memory bandwidth pressure and are usually computebound as they require a large number of convolution computations. Given the input feature map I and filter weights W, the output feature map O is computed as follows:
(1) 
where I , W , and O . is the number of images in a batch and is the total number of filters in the CONV layer. represents the number of channels in the input feature maps and filter weights. and denote the height and width of the input feature maps, respectively, whereas and
denote the height and width of filter weights, respectively. The vertical and horizontal strides are given by
and , respectively. The height and width of the output feature maps are given by and , respectively.FC layers: the neurons in an FC layer are fully connected with neurons in the previous layer with a specific weight associated with each connection. It is the most memoryintensive layer in CNNs [48, 73]
since no weight is reused. The computation of the FC layer can be represented by a matrixvector multiplication as follows:
(2) 
where W , y, b , and x . The output and input neurons of the FC layer are represented in vector form as y and x. W represents the weight matrix and b
is the bias vector associated with the output neurons.
2.2 Exploiting sparsity in CNN accelerators
It is known that the sparsity levels of CNN weights typically range from 20% to 80% [25, 26]
, and when the rectified linear unit (ReLU) activation function is employed, the activations are clamped to zeros in the 50% to 70% range
[44]. The combination of weight and activation sparsities can reduce computations and memory accesses significantly if the accelerator can support sparsityaware operations. In order to speed up CNN evaluation by utilizing weight/activation sparsity, the first step is to encode the sparse data in a compressed format that can be efficiently processed by accelerators. EIE is an accelerator that encodes a sparse weight matrix in a compressed sparse column (CSC) format [65] and uses a second vector to encode the number of zeros between adjacent nonzero elements [24]. However, it is only used to speed up the FC layers and has no impact on the CONV layers. Hence, a majority of CNN computations do not benefit from sparsityaware acceleration. A lightweight runtime output sparsity predictor has been developed in SparseNN, an architecture enhanced from EIE, to accelerate CNN inference [75]. Activations in the CSC format are first fed to the lightweight predictor to predict the nonzero elements in the output neurons. Then, the activations associated with nonzero outputs are sent to feedforward computations to bypass computations that lead to zero outputs. If the number of computations skipped is large enough, the overhead of output predictions can be offset. However, since the output sparsity predicted by the lightweight predictor is an approximation of the real sparsity value, it incurs an accuracy loss that makes it unsuitable for CNN training. SCNN is another accelerator that uses a zerostep format to encode weight/activation sparsity: an index vector is used to indicate the number of nonzero data points and the number of zeros before each nonzero data point. It multiplies activation and weight vectors in a manner similar to a Cartesian product using an input stationary dataflow [44]. However, the Cartesian product does not automatically align nonzero weights and activations in the FC layers since the FC layer weights are not reused as in the case of CONV layers. This leads to performance degradation for FC layers and makes SCNN unattractive for CNNs dominated by FC layers. Cnvlutin [2] enhances the DaDianNao architecture to support zeroskipping in activations using a zerostep offset vector that is similar to graphics processor proposals [20, 19, 43, 36]. The limitation of this architecture is that the length of offset vectors in different PEs may be different. Hence, they may require different numbers of cycles to process the data. Thus, the PE with the longest offset vector becomes the performance bottleneck while other PEs idle and wait for it. CambriconX is an accelerator that also employs the zerostep sparsity encoding method and uses a dedicated indexing module to select and transfer needed neurons to PEs, with a reduced memory bandwidth requirement [74]. The PEs run asynchronously to avoid the idling problem of Cnvlutin. UCNN is an accelerator that improves CNN inference performance by exploiting weight repetition in the CONV layers [28]. It uses a factorized dot product dataflow to reduce the number of multiplications and a memorization method to reduce weight memory access via weight repetition.Both the CSC and zerostep encoding formats compress data by eliminating zeroelements and the accelerators discussed above efficiently process the compressed data. However, weight/activation sparsity can not only be exploited at the PE level but also be at the bit level. Stripes, a bitserial hardware accelerator, avoids the processing of zero prefix and suffix bits through serialparallel multiplications on CNNs [34]. Each bit of a neuron is processed at every cycle and zero bits are skipped on the fly. Multiple neurons are processed in parallel to mitigate performance loss from bitserial processing. Pragmatic, a CNN accelerator enhanced from Stripes, supports zerobit skipping regardless of its position [3]. However, it needs to convert the input neuron representation into a zerobitonly format on the fly, which leads to up to a 16cycle latency.
2.3 Lowprecision CNN training algorithms
The rapid evolution of CNNs in recent years has necessitated the deployment of largescale distributed training using highperformance computing infrastructure [59, 77, 7]. Even with such a powerful computing infrastructure, training a CNN to convergence usually takes several days, sometimes even a few weeks. Hence, to speed up the CNN training process, various training algorithms with lowprecision computations have been proposed.
Singleprecision floatingpoint (FP32) operation has mainly been used as the training standard on GPUs. Meanwhile, efforts have been made to train CNNs with halfprecision floatingpoint (FP16) arithmetic since it can improve training throughput by 2, in theory, on the same computing infrastructure. However, compared to FP32, FP16 involves rounding off gradient values and quantizing to a lowerprecision representation. This introduces noise in the training process and defers CNN convergence. To maintain a balance between the convergence rate and training throughput, mixedprecision training algorithms that use a combination of FP32 and FP16 have been proposed [32, 41]
. The FP16 representation is used in the most computeintensive multiplications and the results are accumulated into FP32 outputs. Dynamic scaling is required to prevent the vanishing gradient problem
[30].Compared to floatingpoint arithmetic, fixedpoint operations are much faster and more energyefficient on hardware accelerators, but have a lower dynamic range. To overcome the dynamic range limitation, the dynamic fixedpoint format
[67] is used in CNN training [12, 14]. Unlike the regular fixedpoint format, the dynamic fixedpoint format uses multiple scaling factors that are updated during training to adjust the dynamic range of different groups of variables. The CNN training convergence rate is highly sensitive to the rounding scheme used in fixedpoint arithmetic [23]. Instead of tuning the dynamic range used in the dynamic fixedpoint format, a stochastic rounding method has been proposed to leverage the noise tolerance of CNN algorithms [23]. CNNs are trained in a manner that the rounding error is exposed to the network and weights are updated accordingly to mitigate this error, without impacting the convergence rate.2.4 Efficient onchip memory interface and emerging NVRAM technologies
CNN training involves feeding vast input feature maps and filter weights to the accelerator computing units to compute the error gradients used to update CNN weights in backpropagation. Besides the large memory size required to store all the CNN weights, a high memory bandwidth becomes indispensable to keep running the computing units at full throughput. Hence, throughsilicon via (TSV)based 3D memory interfaces have been used on highend GPUs
[11] and specialized CNN accelerators [69]. The most widelyused TSVbased 3D memory interface is HBM. In each HBM package, multiple DRAM dies and one memory controller die are first fabricated and tested individually. Then, these dies are aligned, thinned, and bonded using TSVs. The HBM package is connected to the processor using an interposer in a 2.5D manner. This shortens the interconnects within the memory system and between the memory and processor, thus reducing memory access latency and improving memory bandwidth. In addition, since more DRAM dies are integrated within the same footprint area, HBM enables smaller form factors: HBM2 uses 94% less space relative to GDDR5 for a 1GB memory [5].Apart from improving the DRAM interface, the industry has also been exploring various NVRAM technologies to replace DRAM, such as ferroelectric RAM (FeRAM), spintransfer torque magnetic RAM (STTMRAM), phasechange memory (PCM), nanotube RAM (NRAM), and resistive RAM (RRAM). It has been shown in [70, 71] that RRAM can be used in an efficient 3D memory interface to deliver high memory bandwidth and energy efficiency. Information is represented by different resistance levels in an RRAM cell. Compared to a DRAM, an RRAM cell needs a higher current to change its resistance level. Therefore, the access transistors of an RRAM are larger than those of a DRAM [56]. However, DRAM is expected to reach the scaling limit at 16 [63] whereas RRAM is believed to be suitable for sub10 nodes [13]. Hence, the smaller technology node of an RRAM should offset the access transistor overhead. Besides, the nonvolatility of RRAM eliminates the need for a periodic refresh that a DRAM requires. This not only saves energy and reduces latency, but also gets rid of the refresh circuitry used in DRAM.
3 Sparsityaware reducedprecision accelerator architecture
In this section, we present the proposed architecture, SPRING: a sparsityaware reducedprecision CNN accelerator for both training and inference. We first discuss accelerator architecture design and then dive into sparsityaware acceleration, reducedprecision processing, and the monolithic 3D NVRAM interface.
Fig. 2 shows the highlevel view of the architecture. SPRING uses monolithic 3D integration to connect the accelerator tier with an RRAM interface. Unlike TSVbased 3D integration, monolithic 3D integration only has one substrate wafer, where devices are fabricated tier over tier. Hence, the alignment, thinning, and bonding steps of TSVbased 3D integration can be eliminated. In addition, tiers are connected through monolithic intertier vias (MIVs), whose diameter is the same as that of local vias and onetotwo orders of magnitude smaller than that of TSVs. This enables a much higher MIV density ( at 14 [6]), thus leaving much more space for logic. The accelerator tier is put at the bottom, on top of which is the memory controller tier. Above the memory controller tier lie the multiple RRAM tiers.
Fig. 3 shows the organization of the accelerator tier. The control block handles the CNN configuration sent from the CPU. It fetches the instruction stream and controls the rest of the accelerator to perform acceleration. The activations and filter weights are brought onchip from the RRAM system by a direct memory access (DMA) controller. Activations and weights are stored in the activation buffer and weight buffer, respectively, in a compressed format. Data compression relies on binary masks that are stored in a dedicated mask buffer. The compression scheme is discussed in Section 3.1. The compressed data and the associated masks are used in the PEs for CNN evaluation. The PEs are designed to operate in parallel to maximize overall throughput.
Fig. 4 shows the main components of a PE. The compressed data are buffered by the activation FIFO and weight FIFO. Then, they enter the precompute sparsity module along with the binary masks. Multiple multiplieraccumulator (MAC) lanes are used to compute convolutions or matrixvector multiplications using zerofree activations and weights after they are preprocessed by the precompute sparsity module. The output results go through a postcompute sparsity module to maintain the zerofree format. Batch normalization operations [31]
are used in modern CNNs to reduce the covariance shift. They are executed in the batch normalization module that supports both the forward pass and backward pass of batch normalization. Three pooling methods are supported by the pooling module: max pooling, min pooling, and mean pooling. The reshape module deals with matrix transpose and data reshaping. Elementwise arithmetic, such as elementwise add and subtract, is handled by the scalar module. Lastly, a dedicated loss module is used to process various loss functions, such as L1 loss, L2 loss, softmax, etc.
3.1 Sparsityaware acceleration
Traditional accelerator designs can only process dense data and do not support sparseencoded computation. They treat zero elements in the same manner as regular data and thus perform operations that have no impact on the CNN evaluation results. In this context, weight/activation sparsity cannot be used to speed up computation and reduce the memory footprint. In order to utilize sparsity to skip ineffectual activations and weights, and reduce the memory footprint, SPRING uses a binarymask scheme to encode the sparse data and performs computations directly in the encoded format.
Compared to the regular dense format, SPRING compresses data vectors by removing all the zeroelements. In order to retain the shape of the uncompressed data, an extra binary mask is used. The binary mask has the same shape as that of the uncompressed data where each binary bit in the mask is associated with one element in the original data vector. Fig. 5 shows an example of the binarymask scheme that SPRING uses to compress activations and weights. The original uncompressed data vector has 16 elements, and if each element is represented using 16 bits, the total data length is 256 bits. With the binary scheme, only the six nonzero elements remain. The total length of the compressed data vector and the binary mask is 112 bits, which leads to a compression ratio of 2.3 for this example.
We implement the binary mask scheme using a low overhead precompute sparsity module that preprocesses the sparseencoded activations and weights and provides zerofree data to the MAC lanes. After output data traverse the MAC lanes, another postcompute sparsity module is used to remove all the zeroelements generated by the activation function before storing them back to onchip memory. Fig. 6 shows the precompute sparsity module that takes the zerofree data vectors and binary mask vectors as inputs, and generates an output mask as well as zerofree activations/weights for the MAC lanes. The output binary mask indicates the common indexes of nonzero elements in both the activation and weight vectors. After being preprocessed by the precompute sparsity module, the “dangling” nonzero elements in the activation and weight data vectors are removed. The dangling nonzero activations refer to the nonzero elements in the activation data vector where their corresponding weights at the same index are zeros, and vice versa.
Fig. 7(a) shows the mask generation process used by the precompute sparsity module. The output mask is the AND of the activation and weight masks. The output mask, together with the activation and weight masks, is used by two more XOR gates for filter mask generation. Fig. 7(b) shows the dangling data filtering process using the three masks obtained in the previous step. The sequential scanning and filtering mechanism for one type of data used in the filtering step is shown in Algorithm 1. The data vector, as well as the two mask vectors, is scanned in sequence. At each step, a 1 in the output mask implies a common nonzero index. Hence, the corresponding element in the data vector passes through the filter. On the other hand, if a 0 appears in the mask filter and the corresponding mask bit in the filter mask is 1, then a dangling nonzero element is detected in the data vector and is blocked by the filter. If both the output mask bit and filter mask bit are zeros, it means that the data elements at this index in both the activation and weight vectors are zeros and thus already skipped. After filtering out the dangling elements in activations and weights, a zerocollapsing shifter is used to remove the zeros and keep the data vectors zerofree in a similar sequential scanning manner, as shown in Fig. 7(c). These zerofree activations and weights are then fed to the MAC lanes for computation. Since only zerofree data are used in the MAC lanes, ineffectual computations are completely skipped, thus improving throughput and saving energy.
3.2 Reducedprecision processing using stochastic rounding
SPRING processes CNNs using fixedpoint numbers with reduced precision. Every time a new result is generated by the CNN, it has to be first rounded to the nearest discrete number, either in a floatingpoint representation or a fixedpoint representation. Since the gap between adjacent numbers in the fixedpoint representation is much larger than in the floatingpoint representation, the resulting quantization error in the former is much more pronounced. This prevents the fixedpoint representation from being used in errorsensitive CNN training. In order to utilize the faster and more energyefficient fixedpoint arithmetic units, we adopt the stochastic rounding method proposed in [23]. The traditional deterministic rounding scheme always rounds a real number to its nearest discrete number, as shown in Eq. 3. We follow the definitions used in [23], where denotes the smallest positive discrete number supported in the fixedpoint format and is defined as the largest integer multiple of less than or equal to x.
(3) 
In contrast, a real number is rounded to and stochastically in the stochastic rounding scheme, as shown in Eq. 4 [23]. It is shown in [23] that with the stochastic rounding scheme, the CNN weights can be trained to tolerate the quantization noise without increasing the number of cycles required for convergence.
(4) 
The stochastic rounding scheme is embedded in the MAC lane, as shown in Fig. 8. Activations and weights are represented using fixedpoint numbers using IL+FL bits, where IL denotes the number of bits for the integer portion and FL denotes the number of bits for the fraction part. The zerofree activations and weights from the precompute sparsity module are subject to multiplications in the MAC lanes, where the products are represented with 2IL integer bits and 2FL fractional bits to prevent overflow. Accumulations over products are also performed using 2(IL+FL) bits. Then, a stochastic rounding module is used to reduce the numerical precision before applying the activation function or storing the result back to onchip memory. We use a linearfeedback shift register to generate pseudorandom numbers for stochastic rounding.
3.3 Monolithic 3D NVRAM interface
SPRING uses a monolithic 3D NVRAM interface previously proposed in [70] and adapts it to its 3D architecture to provide the accelerator tier with significant memory bandwidth. As shown in Fig. 2, SPRING uses two memory channels where each channel has its own memory controller to control the associated two RRAM ranks. An ultrawide memory bus (1KB wide) is used in each channel, since the interconnects between SPRING and memory controllers, and between memory controllers and RRAM ranks, are implemented using vertical MIVs. This onchip memory bus not only reduces the access latency relative to the conventional offchip memory bus, but also makes rowwide granular memory accesses possible to enable energy savings. In addition, the column decoder can be removed to reduce the access latency and power dissipation in this rowwide access granularity scheme. To reduce repeated accesses to the same row, especially the energyconsuming write accesses of RRAM, the row buffer is reused as the write buffer. A dirty bit is used to indicate if the corresponding row entry in the row buffer needs to be written back to the RRAM array when flushed out. The read and write accesses are decoupled by adding another set of vertical interconnects, as shown in Fig. 9 [71]. Hence, the slower write access does not block the faster read access and thus a higher memory bandwidth is achieved. In addition, RRAM nonvolatility not only enables the elimination of bulky periodic refresh circuitry, but also allows the RRAM arrays to be powered down in the idle intervals to reduce leakage power. A ranklevel adaptive powerdown policy is used to maintain a balance between performance and energy saving: the powerdown threshold for each RRAM rank is adapted to its idling pattern so that a rank is only powered down if it is expected to be idle for a long time.
4 Simulation methodology
In this section, we present the simulation flow for SPRING and the experimental setup.
Fig. 10
shows the simulation flow used to evaluate the proposed SPRING accelerator architecture. We implement components of SPRING at the registertransfer level (RTL) with SystemVerilog to estimate delay, power, and area. The RTL design is synthesized by Design Compiler
[60] using a 14nm FinFET technology library [22]. Floorplanning is done by Capo [50], an opensource floorplacer. Onchip buffers are modeled using FinCACTI [54], a cache modeling tool enhanced from CACTI [42], to support deeplyscaled FinFETs at the 14nm technology node. The monolithic 3D RRAM system is modeled by NVSim [16], a circuitlevel memory simulator for emerging NVRAMs, and NVMain [46], an emerging NVRAM architecture simulator. The synthesized results, together with buffer and RRAM estimations, are then plugged into a customized cycleaccurate Python simulator. This accelerator simulator takes CNNs in the TensorFlow [1] Protocol Buffers format and estimates the computation latency, power dissipation, energy consumption, and area. SPRING treats the TensorFlow operations like complex instruction set computer instructions where each operation involves many lowlevel operations.We compare our design with the Nvidia GeForce GTX 1080 Ti GPU, which uses the Pascal microarchitecture [18] in a 16nm technology node. The die size of GTX 1080 Ti is 471 and the base operating frequency is 1.48 , which can be boosted to 1.58 . GTX 1080 Ti uses an 11 GB GDDR5X memory with 484 GB/s memory bandwidth to provide 10.16 TFLOPS peak single precision performance.
We evaluate SPRING and GTX 1080 Ti on seven wellknown CNNs: InceptionResnet V2 [61], Inception V3 [62], MobileNet V2 [53], NASNetmobile [76], PNASNetmobile [38], Resnet152 V2 [27], and VGG19 [58]
. We evaluate both the training and inference phases of these CNNs on the ImageNet dataset
[51]. We use the default batch sizes defined in the TensorFlowSlim library [57]: 32 for training and 100 for inference.5 Experimental results
In this section, we present experimental results for SPRING and compare them with those for GTX 1080 Ti.
Table I shows the values of various design parameters used in SPRING. They are obtained through the accelerator design space exploration methodology proposed in [72]. It is shown in [23] that with 16 FL bits, training CNNs using the stochastic rounding scheme can converge in a similar amount of time with a negligible accuracy loss relative to when singleprecision floatingpoint arithmetic is used. Hence, we use 4 IL bits and 16 FL bits in the fixedpoint representation. The convolution loop order refers to the execution order of the multiple forloops in the CONV layer. SPRING executes convolutions by first unrolling the forloops across multiple inputs in the batch. Then, it unrolls the forloops within the filter weights, followed by unrolling in the activation channel dimension. In the next step, it unrolls the forloops with activation feature maps. Finally, it unrolls forloops across the output channels. At a similar technology node (14nm vs. 16nm), SPRING reduces chip area by 68% relative to GTX 1080 Ti.
Accelerator parameters  Values 

Clock rate  700 
Number of PEs  64 
Number of MAC lanes per PE  72 
Number of multipliers per MAC lane  16 
Weight buffer size  24 MB 
Activation buffer size  12 MB 
Mask buffer size  4 MB 
Convolution loop order  batchweightin channelinputout channel 
IL bits  4 
FL bits  16 
Technology  14nm FinFET 
Area  151 
Monolithic 3D RRAM  8 GB, 2 channels, 2 ranks, 16 banks, 
1 KB bus, =0.5 ns, 2.0 [56] 
Fig. 11 and Fig. 12 show the normalized performance of SPRING and GTX 1080 Ti over the seven CNNs in the training and inference phases, respectively. All results are normalized to those of GTX 1080 Ti. In the training phase, SPRING achieves speedups ranging from 5.5 to 53.1
with a geometric mean of 15.6
on the seven CNNs. In the inference phase, SPRING is faster than GTX 1080 Ti by 5.1 to 67.9 with a geometric mean of 15.5. In both cases, SPRING has better performance speedups on relatively lightweight CNNs, i.e., MobileNet V2, NASNetmobile, and PNASNetmobile. This is because these lightweight CNNs do not require large volumes of activations and weights to be transferred between the external memory and onchip buffers. Therefore, the memory bandwidth bottleneck is alleviated and the speedup from sparsityaware computation becomes more noteworthy. On the other hand, on large CNNs, such as InceptionResnet V2 and VGG19, the sparsityaware MAC lanes of SPRING idle and wait for data fetch from the RRAM system, lowering the performance speedup relative to GTX 1080 Ti.Fig. 13 and Fig. 14 show the normalized reciprocal of power of SPRING and GTX 1080 Ti in training and inference, respectively. All results are normalized to those of GTX 1080 Ti. On an average, SPRING reduces power dissipation by 4.2 and 4.5 for training and inference, respectively.
Fig. 15 and Fig. 16 show the normalized energy efficiency of SPRING and GTX 1080 Ti for training and inference, respectively. All results are normalized to those of GTX 1080 Ti. Compared to the GTX 1080 Ti, SPRING achieves an average of 66.0 and 69.1 energy efficiency improvements in training and inference, respectively. This makes the GTX 1080 Ti columns invisible. We observe that, among the seven CNNs, SPRING achieves the best normalized energy efficiency on MobileNet V2, both in the training and inference phases. Since MobileNet V2 has a much smaller network size (97.6% parameter reduction compared to VGG19 [53]), most of the network weights can be retained in onchip buffers without accessing the external memory. Hence, SPRING can reduce energy consumption significantly through our sparsityaware acceleration scheme. On the other hand, energy reduction from sparsityaware computation is offset by energyconsuming memory accesses on large CNNs, such as InceptionResnet V2 and VGG19. This is consistent with the results reported in [74] that show that over 80% of the total energy consumption is from memory access.
6 Discussions and limitations
In this section, we discuss the assumptions we made in this work and the limitations of the SPRING architecture.
The performance speedup, power reduction ratio, and energy efficiency improvement reported in Section 5 are obtained at the batch level. We use batchlevel training results since the CNN training results are based on the assumption that with sufficient precision bits, fixedpoint training using stochastic rounding scheme can lead to convergence with no worse number of cycles than the training process based on singleprecision floatingpoint arithmetic, as suggested in [23]
, where 16 FL bits are used for fixedpoint training with stochastic rounding and the convergent epoch number is similar to that of singleprecision floatingpoint training.
A major limitation of the SPRING accelerator architecture is that the sequential scanning and filtering mechanism shown in Algorithm 1 needs multiple cycles to filter out dangling nonzero elements and collapse the resulting zeros. This may incur a long latency in data preprocessing, which makes SPRING unsuitable for latencysensitive edge inference applications. However, since this sequential scanning and filtering scheme is pipelined, the overall throughput is unaffected and therefore the total latency for one batch is independent of the sequential scanning steps used by the precompute sparsity module.
Our binary mask encoding method is similar to the dual indexing encoding proposed in [37]. Although we both use a binary mask to point to the index of nonzero elements in the data vector, our binary mask encoding scheme has several advantages. First, the index masks are kept in binary form throughout the entire sparsity encoding and decoding process. Hence, the storage overhead of the binary mask is at most 5%, assuming 4 IL bits and 16 FL bits. The real storage overhead is much lower than this value since most of activations and weights are zeros. However, the binary masks are converted to decimal masks in [37] to serve as select signals of a MUX. This not only increases the storage overhead of the masks, but also increases the computation complexity of mask manipulation. Besides, their binarytodecimal mask transfer process is sequential, which incurs a long processing latency that increases as the size of the mask vector increases.
7 Conclusion
In this article, we proposed a sparsityaware reducedprecision CNN accelerator, named SPRING. A binary mask scheme is used to encode weight/activation sparsity. It is efficiently processed through a sequential scanning and filtering mechanism. SPRING adopts the stochastic rounding algorithm to train CNNs using reducedprecision fixedpoint numerical representation. An efficient monolithic 3D NVRAM interface is used to provide significant memory bandwidth for CNN evaluation. Compared to Nvidia GeForce GTX 1080 Ti, SPRING achieves 15.6, 4.2, and 66.0 improvements in performance, power reduction, and energy efficiency, respectively, in the training phase, and 15.5, 4.5, and 69.1 improvements in performance, power reduction, and energy efficiency, respectively, in the inference phase.
References
 [1] (2016) Tensorflow: a system for largescale machine learning. In Proc. USENIX Symp Operating Syst. Design Implementation, pp. 265–283. Cited by: §4.
 [2] (201606) Cnvlutin: ineffectualneuronfree deep neural network computing. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 1–13. Cited by: §1, §2.2.
 [3] (2017Oct.) Bitpragmatic deep neural network computing. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 382–394. Cited by: §2.2.
 [4] (2016Oct.) Fusedlayer CNN accelerators. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 1–12. Cited by: §1.
 [5] (2015)(Website) External Links: Link Cited by: §2.4.
 [6] (201405) 3D sequential integration opportunities and technology optimization. In Proc. IEEE Int. Interconnect Technology Conf., pp. 373–376. Cited by: §3.

[7]
(201705)
Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster
. Procedia Computer Science 108, pp. 315–324. Cited by: §2.3.  [8] (2014Mar.) DianNao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning. In Proc. Int. Conf. Architectural Support Programming Languages Operating Syst., pp. 269–284. Cited by: §1.
 [9] (201606) Eyeriss: a spatial architecture for energyefficient dataflow for convolutional neural networks. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 367–379. Cited by: §1.
 [10] (2014Dec.) DaDianNao: a machinelearning supercomputer. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 609–622. Cited by: §1, §2.1.
 [11] (2018Mar.) Volta: performance and programmability. IEEE Micro 38 (2), pp. 42–52. Cited by: §1, §2.4.
 [12] (201505) Training deep neural networks with low precision multiplications. In Proc. Int. Conf. Learning Representations, Cited by: §2.3.
 [13] (2016)(Website) External Links: Link Cited by: §2.4.
 [14] (201805) Mixed precision training of convolutional neural networks using integer operations. In Proc. Int. Conf. Learning Representations, Cited by: §2.3.
 [15] (2017Oct.) CirCNN: accelerating and compressing deep neural networks using blockcirculant weight matrices. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 395–408. Cited by: §1.
 [16] (201207) NVSim: a circuitlevel performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.Aided Design Integr. Circuits Syst. 31 (7), pp. 994–1007. Cited by: §4.
 [17] (201506) ShiDianNao: shifting vision processing closer to the sensor. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 92–104. Cited by: §1.
 [18] (2017Mar.) Ultraperformance Pascal GPU and NVLink interconnect. IEEE Micro 37 (2), pp. 7–17. Cited by: §1, §4.
 [19] (2011Feb.) Thread block compaction for efficient SIMT control flow. In Proc. IEEE Int. Symp. High Performance Computer Architecture, pp. 25–36. Cited by: §2.2.
 [20] (2007Dec.) Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 407–420. Cited by: §2.2.
 [21] (2017) TETRIS: scalable and efficient neural network acceleration with 3D memory. In Proc. Int. Conf. Architectural Support Programming Languages Operating Syst., pp. 751–764. Cited by: §1.
 [22] (2018Oct.) Hybrid monolithic 3D IC floorplanner. IEEE Trans. Very Large Scale Integration Syst. 26 (10), pp. 1868–1880. Cited by: §4.
 [23] (201507) Deep learning with limited numerical precision. In Proc. Int. Conf. Machine Learning, pp. 1737–1746. Cited by: §1, §2.3, §3.2, §3.2, §5, §6.
 [24] (201606) EIE: efficient inference engine on compressed deep neural network. In Proc. Int. Symp. Computer Architecture, pp. 243–254. Cited by: §1, §2.2.
 [25] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.2.
 [26] (2015) Learning both weights and connections for efficient neural networks. In Proc. Int. Conf. Neural Information Processing Syst., pp. 1135–1143. Cited by: §2.2.
 [27] (2016Oct.) Identity mappings in deep residual networks. In Proc. European Conf. Computer Vision, pp. 630–645. Cited by: §4.
 [28] (201806) UCNN: exploiting computational reuse in deep neural networks via weight repetition. In Proc. Int. Symp. Computer Architecture, pp. 674–687. Cited by: §1, §2.2.
 [29] (2014)(Website) External Links: Link Cited by: §1.
 [30] (1998Apr.) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. Journal Uncertainty, Fuzziness KnowledgeBased Syst. 6 (2), pp. 107–116. Cited by: §2.3.
 [31] (201507) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Machine Learning, pp. 448–456. Cited by: §3.
 [32] (201807) Highly scalable deep learning training system with mixedprecision: training Imagenet in four minutes. arXiv preprint arXiv:1807.11205. Cited by: §2.3.

[33]
(201706)
Indatacenter performance analysis of a tensor processing unit
. In Proc. Int. Symp. Computer Architecture, pp. 1–12. Cited by: §1.  [34] (2016Oct.) Stripes: bitserial deep neural network computing. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 1–12. Cited by: §2.2.
 [35] (2012Dec.) Imagenet classification with deep convolutional neural networks. In Proc. Int. Conf. Neural Information Processing Syst., pp. 1097–1105. Cited by: §1.
 [36] (2013Feb.) Convergence and scalarization for dataparallel architectures. In Proc. IEEE/ACM Int. Symp. Code Generation Optimization, pp. 1–11. Cited by: §2.2.
 [37] (2018Jan.) Supporting compressedsparse activations and weights on SIMDlike accelerator for sparse convolutional neural networks. In Proc. Asia South Pacific Design Automation Conf., pp. 105–110. Cited by: §6.
 [38] (2018Sep.) Progressive neural architecture search. In Proc. European Conf. Computer Vision, pp. 19–34. Cited by: §4.
 [39] (2017Feb.) FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In Proc. IEEE Int. Symp. High Performance Computer Architecture, pp. 553–564. Cited by: §1.
 [40] (2017) Modeling the resource requirements of convolutional neural networks on mobile devices. In Proc. ACM Int. Conf. Multimedia, pp. 1663–1671. Cited by: §1.
 [41] (201805) Mixed precision training. In Proc. Int. Conf. Learning Representations, Cited by: §2.3.
 [42] (2009) CACTI 6.0: a tool to model large caches. HP Laboratories, pp. 22–31. Cited by: §4.
 [43] (2011Dec.) Improving GPU performance via large warps and twolevel warp scheduling. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 308–317. Cited by: §2.2.
 [44] (201706) SCNN: an accelerator for compressedsparse convolutional neural networks. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 27–40. Cited by: §1, §2.2.
 [45] (2016Mar.) FPGA based implementation of deep neural networks using onchip memory only. In Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, pp. 1011–1015. Cited by: §1.
 [46] (201507) NVMain 2.0: a userfriendly memory simulator to model (non)volatile memory systems. IEEE Comput. Archit. Lett. 14 (2), pp. 140–143. Cited by: §4.
 [47] (2014) A reconfigurable fabric for accelerating largescale datacenter services. In Proc. Int. Symp. Computer Architecuture, pp. 13–24. Cited by: §1.
 [48] (2016) Going deeper with embedded FPGA platform for convolutional neural network. In Proc. ACM/SIGDA Int. Symp. FieldProgrammable Gate Arrays, pp. 26–35. Cited by: §1, §2.1.
 [49] (2018Feb.) Compressing DMA engine: leveraging activation sparsity for training deep neural networks. In Proc. IEEE Int. Symp. High Performance Computer Architecture, pp. 78–91. Cited by: §1.
 [50] (2005Apr.) Capo: robust and scalable opensource mincut floorplacer. In Proc. Int. Symp. Physical design, pp. 224–226. Cited by: §4.
 [51] (2015) ImageNet Large Scale Visual Recognition Challenge. Int. Journal Computer Vision 115 (3), pp. 211–252. Cited by: §4.
 [52] (2016)(Website) External Links: Link Cited by: §1.

[53]
(201806)
MobileNetV2: inverted residuals and linear bottlenecks.
In
Proc. IEEE Conf. Computer Vision Pattern Recognition
, pp. 4510–4520. Cited by: §4, §5.  [54] (201407) FinCACTI: architectural analysis and modeling of caches with deeplyscaled FinFET devices. In Proc. IEEE Computer Society Annual Symp. VLSI, pp. 290–295. Cited by: §4.
 [55] (2016Aug.) Overcoming resource underutilization in spatial CNN accelerators. In Proc. Int. Conf. Field Programmable Logic Applications, pp. 1–4. Cited by: §1.
 [56] (200906) A 5ns fast write multilevel nonvolatile 1 K bits RRAM memory with advance write scheme. In Proc. Symp VLSI Circuits, pp. 82–83. Cited by: §2.4, TABLE I.
 [57] (2016)(Website) External Links: Link Cited by: §4.
 [58] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.
 [59] (2015Sep.) Scalable distributed DNN training using commodity GPU cloud computing. In Proc. Conf. Int. Speech Communication Association, Cited by: §2.3.
 [60] (2018)(Website) External Links: Link Cited by: §4.

[61]
(2017Feb.)
Inceptionv4, InceptionResnet and the impact of residual connections on learning
. InProc. AAAI Conf. Artificial Intelligence
, Cited by: §4.  [62] (201606) Rethinking the Inception architecture for computer vision. In Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 2818–2826. Cited by: §4.
 [63] (201305)(Website) External Links: Link Cited by: §2.4.
 [64] (201706) ScaleDeep: a scalable compute architecture for learning and evaluating deep networks. In Proc. Int. Symp. Computer Architecture, pp. 13–26. Cited by: §1.
 [65] (2003) Automatic performance tuning of sparse matrix kernels. Ph.D. Thesis, University of California, Berkeley. Cited by: §2.2.
 [66] (2017Mar.) ChainNN: an energyefficient 1D chain architecture for accelerating deep convolutional neural networks. In Proc. Design, Automation Test Europe Conf. Exhibition, pp. 1032–1037. Cited by: §1.
 [67] (199105) Dynamically scaled fixed point arithmetic. In Proc. IEEE Pacific Rim Conf. Communications, Computers Signal Processing, pp. 315–318 vol.1. Cited by: §2.3.

[68]
(201806)
Accelerating CNN algorithm with finegrained dataflow architectures.
In
Proc. IEEE Int. Conf. High Performance Computing Communications; IEEE Int. Conf. Smart City; IEEE Int. Conf. Data Science Syst.
, pp. 243–251. Cited by: §1.  [69] (2018Dec.) Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992. Cited by: §1, §2.4.
 [70] (201807) Energyefficient monolithic threedimensional onchip memory architectures. IEEE Trans. Nanotechnology 17 (4), pp. 620–633. Cited by: §2.4, §3.3.
 [71] (2018Oct.) A monolithic 3D hybrid architecture for energyefficient computation. IEEE Trans. MultiScale Computing Syst. 4 (4), pp. 533–547. Cited by: §2.4, Fig. 9, §3.3.
 [72] (2019) Softwaredefined design space exploration for an efficient AI accelerator architecture. arXiv preprint arXiv:1903.07676. Cited by: §5.
 [73] (2016Nov.) Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proc. IEEE/ACM Int. Conf. ComputerAided Design, pp. 1–8. Cited by: §2.1.
 [74] (2016Oct.) CambriconX: an accelerator for sparse neural networks. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 1–12. Cited by: §1, §2.2, §5.
 [75] (2018Mar.) SparseNN: an energyefficient neural network accelerator exploiting input and output sparsity. In Proc. Design, Automation Test Europe Conf. Exhibition, pp. 241–244. Cited by: §1, §2.2.
 [76] (201806) Learning transferable architectures for scalable image recognition. In Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 8697–8710. Cited by: §4.
 [77] (2017Oct.) Distributed training largescale deep architectures. In Proc. Int. Conf. Advanced Data Mining Applications, pp. 18–32. Cited by: §2.3.
Comments
There are no comments yet.