Convolutional neural networks (CNNs) excel at various important applications, e.g., image classification, image segmentation, robotics control, and natural language processing. However, their high computational complexity necessitates specially-designed accelerators for efficient processing. Training of CNNs requires an enormous amount of computing power to automatically learn the weights based on a large training dataset. Few ASIC-based CNN training accelerators have been presented[10, 64, 69]. However, graphical processing units (GPUs) typically play a dominant in the training phase as CNN computation essentially maps well to their single-instruction multiple-data (SIMD) units and the large number of SIMD units present in GPUs provide significant computational throughput for training CNNs [18, 11]. In addition, the higher clock speed, bandwidth, and power management capabilities of the Graphics Double Data Rate (GDDR) memory relative to the regular DDR memory make GPUs the de facto accelerator choice for CNN training. On the other hand, CNN inference is more latency- and power-sensitive as an increasing number of applications need real-time CNN evaluations on battery-constrained edge devices. Hence, ASIC- and FPGA-based accelerators have been widely explored for this purpose. However, they can only process low-level CNN operations, such as convolution and matrix multiplication, and lack the flexibility of a general-purpose processor. Although CNN models have evolved rapidly recently, their fundamental building blocks are common and long-lasting. Therefore, the ASIC- and FPGA-based accelerators can efficiently process new CNN models with their domain-specific architectures. FPGA-based accelerators achieve faster time-to-market and enable prototyping of new accelerator designs. Microsoft has used customized FPGA boards, called Catapult , in its data centers to accelerate Bing ranking by 2. An FPGA-based CNN accelerator that uses on-chip memory has been proposed in , where a fixed-point representation is used to keep all the weights stored in on-chip memory thus avoiding the need to access external memory. To improve dynamic resource utilization of FPGA-based CNN accelerators, multiple accelerators, each specialized for a specific CNN layer, have been constructed using the same FPGA hardware resource . A convolver design for both the convolutional (CONV) layer and fully-connected (FC) layer has been proposed in  to efficiently process CNNs on embedded FPGA platforms. ASIC-based CNN accelerators have better energy efficiency and can be fully customized for CNN applications. In , CNNs are mapped entirely within the on-chip memory and the ASIC accelerator is placed close to the image sensor so that all the DRAM accesses are eliminated, leading to a 60 energy efficiency improvement relative to previous works. A 1D chain ASIC architecture is used in  to accelerate the CONV layers since these layers are the most compute-intensive 
. To speed up the CONV layers, a Fast Fourier Transform-based fast multiplication is used in. This accelerator encodes weights using block-circulant matrices and converts convolutions into matrix multiplications to reduce the computational complexity from to and storage complexity from to . A 3D memory system is used in  to reduce memory bandwidth pressure. This enables more chip area to be devoted to processing elements (PEs), thus increasing performance and energy efficiency.
To take advantage of the underlying parallel computing resources of CNN accelerators, an efficient dataflow is necessary to minimize data movement between the on-chip memory and PEs. Unlike the temporal architectures, like SIMD or single-instruction multiple-thread, used in central processing units (CPUs) and GPUs, the Google Tensorflow processing unit (TPU) uses a spatial architecture, called the systolic array. Data flow into arithmetic logic units (ALUs) in a wave and move through adjacent ALUs in order to be reused. In , multiple CNN layers are fused and processed so that intermediate data can be kept in on-chip memory, thus obviating the need for external memory access. A fine-grained dataflow accelerator is proposed in . It converts convolution into data preprocessing and matrix multiplication. Data are directly transferred among PEs, without the need for redundant control logic, as opposed to temporal architectures, such as DianNao . In 9] to reuse data and minimize data movement on a spatial architecture.
Although various dataflow styles and computational parallelism designs have been explored in recent works, potential speedup from weight/activation sparsity is still underexplored. The computation and memory footprint of CNNs can be significantly reduced if sparsity is exploited during network evaluations. Some recent works utilize sparsity to speed up CNN evaluations [24, 75, 44, 28, 2, 74]. However, they only consider either activation or weight sparsity, and only use sparsity during CNN inference based on various pruning methods. It has been shown that the average network-wide activation sparsity of the well-known AlexNet CNN  during its entire training process is 62% (a maximum of 93%) . Therefore, the training process can be significantly accelerated if sparsity is exploited. Another CNN acceleration technique is to use reduced precision to improve performance and energy efficiency. For example, TPU and DianNao use 8-bit and 16-bit fixed-point quantizations, respectively, in CNN evaluations. However, low-precision accelerators are currently mainly used in the inference phase, since CNN training involves gradient computation and propagation that require high-precision floating-point operations to achieve high accuracy. Apart from improving the efficiency of the computational resources employed in CNN accelerators, CNN training also requires a large memory bandwidth to store activations and weights. In the forward pass, activations must be retained in the memory until the backward pass commences in order to compute the error gradients and update weights. Besides, in order to fill the SIMD units of GPUs, a large amount of data is needed from the memory. Hence, 3D memory systems, such as hybrid memory cube (HMC)  and high bandwidth memory (HBM) , have been used in high-end GPUs to provide significant memory bandwidth for CNN training.
In this article, we make the following contributions:
1) We propose a novel sparsity-aware CNN accelerator architecture, called SPRING. It encodes activation and weight sparsities with binary masks and uses efficient low-overhead hardware implementations for CNN training and inference.
2) SPRING uses reduced-precision fixed-point operations for both training and inference. A dedicated module is used to implement the stochastic rounding algorithm  to prevent accuracy loss during CNN training.
3) SPRING uses an efficient monolithic 3D nonvolatile RAM (NVRAM) interface to provide significant memory bandwidth for CNN processing. This alleviates the performance bottleneck in CNN training since the training process is usually memory-bound .
We test the proposed SPRING architecture on seven well-known CNNs in the context of both training and inference. Simulation results show that the average execution time, power dissipation, and energy consumption are reduced by 15.6, 4.2, and 66.0, respectively, for CNN training, and 15.5, 4.5, and 69.1, respectively, for inference, relative to Nvidia GeForce GTX 1080 Ti.
The rest of the article is organized as follows. Section 2 discusses the background information required to understand our sparsity-aware accelerator. Section 3 presents the sparsity-aware reduced-precision accelerator architecture. Section 4 describes our simulation setup and flow. Section 5 presents experimental results obtained on seven typical CNNs. Section 6 discusses the limitations of our work. Section 7 concludes the article.
In this section, we discuss the background material necessary for understanding our proposed sparsity-aware reduced-precision accelerator architecture. We first give a primer on CNNs. We then discuss existing sparsity-aware designs. Then, we discuss various CNN training algorithms that use low numerical precision. Finally, we describe an efficient on-chip memory interface that is used for CNN acceleration.
2.1 CNN overview
Although different CNNs have different hyperparameters, such as the number of layers and shapes, they share a similar architecture, as shown in Fig.1
. CNNs are generally composed of five building blocks: CONV layers, activation (ACT) layers, pooling (POOL) layers, batch normalization layers (not shown in Fig.1), and FC layers. Among these basic components, the CONV and FC layers are the most compute-intensive . We describe them next.
CONV layers: a batch of 3D input feature maps are convolved with a set of 3D filter weights to generate a batch of 3D output feature maps. The filter weights are usually fetched from external memory once and stored in on-chip memory as they are shared among multiple convolution windows. Therefore, CONV layers have relatively low memory bandwidth pressure and are usually compute-bound as they require a large number of convolution computations. Given the input feature map I and filter weights W, the output feature map O is computed as follows:
where I , W , and O . is the number of images in a batch and is the total number of filters in the CONV layer. represents the number of channels in the input feature maps and filter weights. and denote the height and width of the input feature maps, respectively, whereas and
denote the height and width of filter weights, respectively. The vertical and horizontal strides are given byand , respectively. The height and width of the output feature maps are given by and , respectively.
FC layers: the neurons in an FC layer are fully connected with neurons in the previous layer with a specific weight associated with each connection. It is the most memory-intensive layer in CNNs [48, 73]
since no weight is reused. The computation of the FC layer can be represented by a matrix-vector multiplication as follows:
where W , y, b , and x . The output and input neurons of the FC layer are represented in vector form as y and x. W represents the weight matrix and b
is the bias vector associated with the output neurons.
2.2 Exploiting sparsity in CNN accelerators
Both the CSC and zero-step encoding formats compress data by eliminating zero-elements and the accelerators discussed above efficiently process the compressed data. However, weight/activation sparsity can not only be exploited at the PE level but also be at the bit level. Stripes, a bit-serial hardware accelerator, avoids the processing of zero prefix and suffix bits through serial-parallel multiplications on CNNs . Each bit of a neuron is processed at every cycle and zero bits are skipped on the fly. Multiple neurons are processed in parallel to mitigate performance loss from bit-serial processing. Pragmatic, a CNN accelerator enhanced from Stripes, supports zero-bit skipping regardless of its position . However, it needs to convert the input neuron representation into a zero-bit-only format on the fly, which leads to up to a 16-cycle latency.
2.3 Low-precision CNN training algorithms
The rapid evolution of CNNs in recent years has necessitated the deployment of large-scale distributed training using high-performance computing infrastructure [59, 77, 7]. Even with such a powerful computing infrastructure, training a CNN to convergence usually takes several days, sometimes even a few weeks. Hence, to speed up the CNN training process, various training algorithms with low-precision computations have been proposed.
Single-precision floating-point (FP32) operation has mainly been used as the training standard on GPUs. Meanwhile, efforts have been made to train CNNs with half-precision floating-point (FP16) arithmetic since it can improve training throughput by 2, in theory, on the same computing infrastructure. However, compared to FP32, FP16 involves rounding off gradient values and quantizing to a lower-precision representation. This introduces noise in the training process and defers CNN convergence. To maintain a balance between the convergence rate and training throughput, mixed-precision training algorithms that use a combination of FP32 and FP16 have been proposed [32, 41]
. The FP16 representation is used in the most compute-intensive multiplications and the results are accumulated into FP32 outputs. Dynamic scaling is required to prevent the vanishing gradient problem.
Compared to floating-point arithmetic, fixed-point operations are much faster and more energy-efficient on hardware accelerators, but have a lower dynamic range. To overcome the dynamic range limitation, the dynamic fixed-point format is used in CNN training [12, 14]. Unlike the regular fixed-point format, the dynamic fixed-point format uses multiple scaling factors that are updated during training to adjust the dynamic range of different groups of variables. The CNN training convergence rate is highly sensitive to the rounding scheme used in fixed-point arithmetic . Instead of tuning the dynamic range used in the dynamic fixed-point format, a stochastic rounding method has been proposed to leverage the noise tolerance of CNN algorithms . CNNs are trained in a manner that the rounding error is exposed to the network and weights are updated accordingly to mitigate this error, without impacting the convergence rate.
2.4 Efficient on-chip memory interface and emerging NVRAM technologies
CNN training involves feeding vast input feature maps and filter weights to the accelerator computing units to compute the error gradients used to update CNN weights in backpropagation. Besides the large memory size required to store all the CNN weights, a high memory bandwidth becomes indispensable to keep running the computing units at full throughput. Hence, through-silicon via (TSV)-based 3D memory interfaces have been used on high-end GPUs and specialized CNN accelerators . The most widely-used TSV-based 3D memory interface is HBM. In each HBM package, multiple DRAM dies and one memory controller die are first fabricated and tested individually. Then, these dies are aligned, thinned, and bonded using TSVs. The HBM package is connected to the processor using an interposer in a 2.5D manner. This shortens the interconnects within the memory system and between the memory and processor, thus reducing memory access latency and improving memory bandwidth. In addition, since more DRAM dies are integrated within the same footprint area, HBM enables smaller form factors: HBM-2 uses 94% less space relative to GDDR5 for a 1GB memory .
Apart from improving the DRAM interface, the industry has also been exploring various NVRAM technologies to replace DRAM, such as ferroelectric RAM (FeRAM), spin-transfer torque magnetic RAM (STT-MRAM), phase-change memory (PCM), nanotube RAM (NRAM), and resistive RAM (RRAM). It has been shown in [70, 71] that RRAM can be used in an efficient 3D memory interface to deliver high memory bandwidth and energy efficiency. Information is represented by different resistance levels in an RRAM cell. Compared to a DRAM, an RRAM cell needs a higher current to change its resistance level. Therefore, the access transistors of an RRAM are larger than those of a DRAM . However, DRAM is expected to reach the scaling limit at 16  whereas RRAM is believed to be suitable for sub-10 nodes . Hence, the smaller technology node of an RRAM should offset the access transistor overhead. Besides, the nonvolatility of RRAM eliminates the need for a periodic refresh that a DRAM requires. This not only saves energy and reduces latency, but also gets rid of the refresh circuitry used in DRAM.
3 Sparsity-aware reduced-precision accelerator architecture
In this section, we present the proposed architecture, SPRING: a sparsity-aware reduced-precision CNN accelerator for both training and inference. We first discuss accelerator architecture design and then dive into sparsity-aware acceleration, reduced-precision processing, and the monolithic 3D NVRAM interface.
Fig. 2 shows the high-level view of the architecture. SPRING uses monolithic 3D integration to connect the accelerator tier with an RRAM interface. Unlike TSV-based 3D integration, monolithic 3D integration only has one substrate wafer, where devices are fabricated tier over tier. Hence, the alignment, thinning, and bonding steps of TSV-based 3D integration can be eliminated. In addition, tiers are connected through monolithic inter-tier vias (MIVs), whose diameter is the same as that of local vias and one-to-two orders of magnitude smaller than that of TSVs. This enables a much higher MIV density ( at 14 ), thus leaving much more space for logic. The accelerator tier is put at the bottom, on top of which is the memory controller tier. Above the memory controller tier lie the multiple RRAM tiers.
Fig. 3 shows the organization of the accelerator tier. The control block handles the CNN configuration sent from the CPU. It fetches the instruction stream and controls the rest of the accelerator to perform acceleration. The activations and filter weights are brought on-chip from the RRAM system by a direct memory access (DMA) controller. Activations and weights are stored in the activation buffer and weight buffer, respectively, in a compressed format. Data compression relies on binary masks that are stored in a dedicated mask buffer. The compression scheme is discussed in Section 3.1. The compressed data and the associated masks are used in the PEs for CNN evaluation. The PEs are designed to operate in parallel to maximize overall throughput.
Fig. 4 shows the main components of a PE. The compressed data are buffered by the activation FIFO and weight FIFO. Then, they enter the pre-compute sparsity module along with the binary masks. Multiple multiplier-accumulator (MAC) lanes are used to compute convolutions or matrix-vector multiplications using zero-free activations and weights after they are preprocessed by the pre-compute sparsity module. The output results go through a post-compute sparsity module to maintain the zero-free format. Batch normalization operations 
are used in modern CNNs to reduce the covariance shift. They are executed in the batch normalization module that supports both the forward pass and backward pass of batch normalization. Three pooling methods are supported by the pooling module: max pooling, min pooling, and mean pooling. The reshape module deals with matrix transpose and data reshaping. Element-wise arithmetic, such as element-wise add and subtract, is handled by the scalar module. Lastly, a dedicated loss module is used to process various loss functions, such as L1 loss, L2 loss, softmax, etc.
3.1 Sparsity-aware acceleration
Traditional accelerator designs can only process dense data and do not support sparse-encoded computation. They treat zero elements in the same manner as regular data and thus perform operations that have no impact on the CNN evaluation results. In this context, weight/activation sparsity cannot be used to speed up computation and reduce the memory footprint. In order to utilize sparsity to skip ineffectual activations and weights, and reduce the memory footprint, SPRING uses a binary-mask scheme to encode the sparse data and performs computations directly in the encoded format.
Compared to the regular dense format, SPRING compresses data vectors by removing all the zero-elements. In order to retain the shape of the uncompressed data, an extra binary mask is used. The binary mask has the same shape as that of the uncompressed data where each binary bit in the mask is associated with one element in the original data vector. Fig. 5 shows an example of the binary-mask scheme that SPRING uses to compress activations and weights. The original uncompressed data vector has 16 elements, and if each element is represented using 16 bits, the total data length is 256 bits. With the binary scheme, only the six non-zero elements remain. The total length of the compressed data vector and the binary mask is 112 bits, which leads to a compression ratio of 2.3 for this example.
We implement the binary mask scheme using a low overhead pre-compute sparsity module that preprocesses the sparse-encoded activations and weights and provides zero-free data to the MAC lanes. After output data traverse the MAC lanes, another post-compute sparsity module is used to remove all the zero-elements generated by the activation function before storing them back to on-chip memory. Fig. 6 shows the pre-compute sparsity module that takes the zero-free data vectors and binary mask vectors as inputs, and generates an output mask as well as zero-free activations/weights for the MAC lanes. The output binary mask indicates the common indexes of non-zero elements in both the activation and weight vectors. After being preprocessed by the pre-compute sparsity module, the “dangling” non-zero elements in the activation and weight data vectors are removed. The dangling non-zero activations refer to the non-zero elements in the activation data vector where their corresponding weights at the same index are zeros, and vice versa.
Fig. 7(a) shows the mask generation process used by the pre-compute sparsity module. The output mask is the AND of the activation and weight masks. The output mask, together with the activation and weight masks, is used by two more XOR gates for filter mask generation. Fig. 7(b) shows the dangling data filtering process using the three masks obtained in the previous step. The sequential scanning and filtering mechanism for one type of data used in the filtering step is shown in Algorithm 1. The data vector, as well as the two mask vectors, is scanned in sequence. At each step, a 1 in the output mask implies a common non-zero index. Hence, the corresponding element in the data vector passes through the filter. On the other hand, if a 0 appears in the mask filter and the corresponding mask bit in the filter mask is 1, then a dangling non-zero element is detected in the data vector and is blocked by the filter. If both the output mask bit and filter mask bit are zeros, it means that the data elements at this index in both the activation and weight vectors are zeros and thus already skipped. After filtering out the dangling elements in activations and weights, a zero-collapsing shifter is used to remove the zeros and keep the data vectors zero-free in a similar sequential scanning manner, as shown in Fig. 7(c). These zero-free activations and weights are then fed to the MAC lanes for computation. Since only zero-free data are used in the MAC lanes, ineffectual computations are completely skipped, thus improving throughput and saving energy.
3.2 Reduced-precision processing using stochastic rounding
SPRING processes CNNs using fixed-point numbers with reduced precision. Every time a new result is generated by the CNN, it has to be first rounded to the nearest discrete number, either in a floating-point representation or a fixed-point representation. Since the gap between adjacent numbers in the fixed-point representation is much larger than in the floating-point representation, the resulting quantization error in the former is much more pronounced. This prevents the fixed-point representation from being used in error-sensitive CNN training. In order to utilize the faster and more energy-efficient fixed-point arithmetic units, we adopt the stochastic rounding method proposed in . The traditional deterministic rounding scheme always rounds a real number to its nearest discrete number, as shown in Eq. 3. We follow the definitions used in , where denotes the smallest positive discrete number supported in the fixed-point format and is defined as the largest integer multiple of less than or equal to x.
In contrast, a real number is rounded to and stochastically in the stochastic rounding scheme, as shown in Eq. 4 . It is shown in  that with the stochastic rounding scheme, the CNN weights can be trained to tolerate the quantization noise without increasing the number of cycles required for convergence.
The stochastic rounding scheme is embedded in the MAC lane, as shown in Fig. 8. Activations and weights are represented using fixed-point numbers using IL+FL bits, where IL denotes the number of bits for the integer portion and FL denotes the number of bits for the fraction part. The zero-free activations and weights from the pre-compute sparsity module are subject to multiplications in the MAC lanes, where the products are represented with 2IL integer bits and 2FL fractional bits to prevent overflow. Accumulations over products are also performed using 2(IL+FL) bits. Then, a stochastic rounding module is used to reduce the numerical precision before applying the activation function or storing the result back to on-chip memory. We use a linear-feedback shift register to generate pseudo-random numbers for stochastic rounding.
3.3 Monolithic 3D NVRAM interface
SPRING uses a monolithic 3D NVRAM interface previously proposed in  and adapts it to its 3D architecture to provide the accelerator tier with significant memory bandwidth. As shown in Fig. 2, SPRING uses two memory channels where each channel has its own memory controller to control the associated two RRAM ranks. An ultra-wide memory bus (1KB wide) is used in each channel, since the interconnects between SPRING and memory controllers, and between memory controllers and RRAM ranks, are implemented using vertical MIVs. This on-chip memory bus not only reduces the access latency relative to the conventional off-chip memory bus, but also makes row-wide granular memory accesses possible to enable energy savings. In addition, the column decoder can be removed to reduce the access latency and power dissipation in this row-wide access granularity scheme. To reduce repeated accesses to the same row, especially the energy-consuming write accesses of RRAM, the row buffer is reused as the write buffer. A dirty bit is used to indicate if the corresponding row entry in the row buffer needs to be written back to the RRAM array when flushed out. The read and write accesses are decoupled by adding another set of vertical interconnects, as shown in Fig. 9 . Hence, the slower write access does not block the faster read access and thus a higher memory bandwidth is achieved. In addition, RRAM nonvolatility not only enables the elimination of bulky periodic refresh circuitry, but also allows the RRAM arrays to be powered down in the idle intervals to reduce leakage power. A rank-level adaptive power-down policy is used to maintain a balance between performance and energy saving: the power-down threshold for each RRAM rank is adapted to its idling pattern so that a rank is only powered down if it is expected to be idle for a long time.
4 Simulation methodology
In this section, we present the simulation flow for SPRING and the experimental setup.
shows the simulation flow used to evaluate the proposed SPRING accelerator architecture. We implement components of SPRING at the register-transfer level (RTL) with SystemVerilog to estimate delay, power, and area. The RTL design is synthesized by Design Compiler using a 14nm FinFET technology library . Floorplanning is done by Capo , an open-source floorplacer. On-chip buffers are modeled using FinCACTI , a cache modeling tool enhanced from CACTI , to support deeply-scaled FinFETs at the 14nm technology node. The monolithic 3D RRAM system is modeled by NVSim , a circuit-level memory simulator for emerging NVRAMs, and NVMain , an emerging NVRAM architecture simulator. The synthesized results, together with buffer and RRAM estimations, are then plugged into a customized cycle-accurate Python simulator. This accelerator simulator takes CNNs in the TensorFlow  Protocol Buffers format and estimates the computation latency, power dissipation, energy consumption, and area. SPRING treats the TensorFlow operations like complex instruction set computer instructions where each operation involves many low-level operations.
We compare our design with the Nvidia GeForce GTX 1080 Ti GPU, which uses the Pascal microarchitecture  in a 16nm technology node. The die size of GTX 1080 Ti is 471 and the base operating frequency is 1.48 , which can be boosted to 1.58 . GTX 1080 Ti uses an 11 GB GDDR5X memory with 484 GB/s memory bandwidth to provide 10.16 TFLOPS peak single precision performance.
We evaluate SPRING and GTX 1080 Ti on seven well-known CNNs: Inception-Resnet V2 , Inception V3 , MobileNet V2 , NASNet-mobile , PNASNet-mobile , Resnet-152 V2 , and VGG-19 
. We evaluate both the training and inference phases of these CNNs on the ImageNet dataset. We use the default batch sizes defined in the TensorFlow-Slim library : 32 for training and 100 for inference.
5 Experimental results
In this section, we present experimental results for SPRING and compare them with those for GTX 1080 Ti.
Table I shows the values of various design parameters used in SPRING. They are obtained through the accelerator design space exploration methodology proposed in . It is shown in  that with 16 FL bits, training CNNs using the stochastic rounding scheme can converge in a similar amount of time with a negligible accuracy loss relative to when single-precision floating-point arithmetic is used. Hence, we use 4 IL bits and 16 FL bits in the fixed-point representation. The convolution loop order refers to the execution order of the multiple for-loops in the CONV layer. SPRING executes convolutions by first unrolling the for-loops across multiple inputs in the batch. Then, it unrolls the for-loops within the filter weights, followed by unrolling in the activation channel dimension. In the next step, it unrolls the for-loops with activation feature maps. Finally, it unrolls for-loops across the output channels. At a similar technology node (14nm vs. 16nm), SPRING reduces chip area by 68% relative to GTX 1080 Ti.
|Number of PEs||64|
|Number of MAC lanes per PE||72|
|Number of multipliers per MAC lane||16|
|Weight buffer size||24 MB|
|Activation buffer size||12 MB|
|Mask buffer size||4 MB|
|Convolution loop order||batch-weight-in channel-input-out channel|
|Monolithic 3D RRAM||8 GB, 2 channels, 2 ranks, 16 banks,|
|1 KB bus, =0.5 ns, 2.0 |
Fig. 11 and Fig. 12 show the normalized performance of SPRING and GTX 1080 Ti over the seven CNNs in the training and inference phases, respectively. All results are normalized to those of GTX 1080 Ti. In the training phase, SPRING achieves speedups ranging from 5.5 to 53.1
with a geometric mean of 15.6on the seven CNNs. In the inference phase, SPRING is faster than GTX 1080 Ti by 5.1 to 67.9 with a geometric mean of 15.5. In both cases, SPRING has better performance speedups on relatively light-weight CNNs, i.e., MobileNet V2, NASNet-mobile, and PNASNet-mobile. This is because these light-weight CNNs do not require large volumes of activations and weights to be transferred between the external memory and on-chip buffers. Therefore, the memory bandwidth bottleneck is alleviated and the speedup from sparsity-aware computation becomes more noteworthy. On the other hand, on large CNNs, such as Inception-Resnet V2 and VGG-19, the sparsity-aware MAC lanes of SPRING idle and wait for data fetch from the RRAM system, lowering the performance speedup relative to GTX 1080 Ti.
Fig. 13 and Fig. 14 show the normalized reciprocal of power of SPRING and GTX 1080 Ti in training and inference, respectively. All results are normalized to those of GTX 1080 Ti. On an average, SPRING reduces power dissipation by 4.2 and 4.5 for training and inference, respectively.
Fig. 15 and Fig. 16 show the normalized energy efficiency of SPRING and GTX 1080 Ti for training and inference, respectively. All results are normalized to those of GTX 1080 Ti. Compared to the GTX 1080 Ti, SPRING achieves an average of 66.0 and 69.1 energy efficiency improvements in training and inference, respectively. This makes the GTX 1080 Ti columns invisible. We observe that, among the seven CNNs, SPRING achieves the best normalized energy efficiency on MobileNet V2, both in the training and inference phases. Since MobileNet V2 has a much smaller network size (97.6% parameter reduction compared to VGG-19 ), most of the network weights can be retained in on-chip buffers without accessing the external memory. Hence, SPRING can reduce energy consumption significantly through our sparsity-aware acceleration scheme. On the other hand, energy reduction from sparsity-aware computation is offset by energy-consuming memory accesses on large CNNs, such as Inception-Resnet V2 and VGG-19. This is consistent with the results reported in  that show that over 80% of the total energy consumption is from memory access.
6 Discussions and limitations
In this section, we discuss the assumptions we made in this work and the limitations of the SPRING architecture.
The performance speedup, power reduction ratio, and energy efficiency improvement reported in Section 5 are obtained at the batch level. We use batch-level training results since the CNN training results are based on the assumption that with sufficient precision bits, fixed-point training using stochastic rounding scheme can lead to convergence with no worse number of cycles than the training process based on single-precision floating-point arithmetic, as suggested in 
, where 16 FL bits are used for fixed-point training with stochastic rounding and the convergent epoch number is similar to that of single-precision floating-point training.
A major limitation of the SPRING accelerator architecture is that the sequential scanning and filtering mechanism shown in Algorithm 1 needs multiple cycles to filter out dangling non-zero elements and collapse the resulting zeros. This may incur a long latency in data preprocessing, which makes SPRING unsuitable for latency-sensitive edge inference applications. However, since this sequential scanning and filtering scheme is pipelined, the overall throughput is unaffected and therefore the total latency for one batch is independent of the sequential scanning steps used by the pre-compute sparsity module.
Our binary mask encoding method is similar to the dual indexing encoding proposed in . Although we both use a binary mask to point to the index of non-zero elements in the data vector, our binary mask encoding scheme has several advantages. First, the index masks are kept in binary form throughout the entire sparsity encoding and decoding process. Hence, the storage overhead of the binary mask is at most 5%, assuming 4 IL bits and 16 FL bits. The real storage overhead is much lower than this value since most of activations and weights are zeros. However, the binary masks are converted to decimal masks in  to serve as select signals of a MUX. This not only increases the storage overhead of the masks, but also increases the computation complexity of mask manipulation. Besides, their binary-to-decimal mask transfer process is sequential, which incurs a long processing latency that increases as the size of the mask vector increases.
In this article, we proposed a sparsity-aware reduced-precision CNN accelerator, named SPRING. A binary mask scheme is used to encode weight/activation sparsity. It is efficiently processed through a sequential scanning and filtering mechanism. SPRING adopts the stochastic rounding algorithm to train CNNs using reduced-precision fixed-point numerical representation. An efficient monolithic 3D NVRAM interface is used to provide significant memory bandwidth for CNN evaluation. Compared to Nvidia GeForce GTX 1080 Ti, SPRING achieves 15.6, 4.2, and 66.0 improvements in performance, power reduction, and energy efficiency, respectively, in the training phase, and 15.5, 4.5, and 69.1 improvements in performance, power reduction, and energy efficiency, respectively, in the inference phase.
-  (2016) Tensorflow: a system for large-scale machine learning. In Proc. USENIX Symp Operating Syst. Design Implementation, pp. 265–283. Cited by: §4.
-  (2016-06) Cnvlutin: ineffectual-neuron-free deep neural network computing. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 1–13. Cited by: §1, §2.2.
-  (2017-Oct.) Bit-pragmatic deep neural network computing. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 382–394. Cited by: §2.2.
-  (2016-Oct.) Fused-layer CNN accelerators. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 1–12. Cited by: §1.
-  (2015)(Website) External Links: Cited by: §2.4.
-  (2014-05) 3D sequential integration opportunities and technology optimization. In Proc. IEEE Int. Interconnect Technology Conf., pp. 373–376. Cited by: §3.
Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster. Procedia Computer Science 108, pp. 315–324. Cited by: §2.3.
-  (2014-Mar.) DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. Int. Conf. Architectural Support Programming Languages Operating Syst., pp. 269–284. Cited by: §1.
-  (2016-06) Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 367–379. Cited by: §1.
-  (2014-Dec.) DaDianNao: a machine-learning supercomputer. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 609–622. Cited by: §1, §2.1.
-  (2018-Mar.) Volta: performance and programmability. IEEE Micro 38 (2), pp. 42–52. Cited by: §1, §2.4.
-  (2015-05) Training deep neural networks with low precision multiplications. In Proc. Int. Conf. Learning Representations, Cited by: §2.3.
-  (2016)(Website) External Links: Cited by: §2.4.
-  (2018-05) Mixed precision training of convolutional neural networks using integer operations. In Proc. Int. Conf. Learning Representations, Cited by: §2.3.
-  (2017-Oct.) CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 395–408. Cited by: §1.
-  (2012-07) NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 31 (7), pp. 994–1007. Cited by: §4.
-  (2015-06) ShiDianNao: shifting vision processing closer to the sensor. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 92–104. Cited by: §1.
-  (2017-Mar.) Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 37 (2), pp. 7–17. Cited by: §1, §4.
-  (2011-Feb.) Thread block compaction for efficient SIMT control flow. In Proc. IEEE Int. Symp. High Performance Computer Architecture, pp. 25–36. Cited by: §2.2.
-  (2007-Dec.) Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 407–420. Cited by: §2.2.
-  (2017) TETRIS: scalable and efficient neural network acceleration with 3D memory. In Proc. Int. Conf. Architectural Support Programming Languages Operating Syst., pp. 751–764. Cited by: §1.
-  (2018-Oct.) Hybrid monolithic 3-D IC floorplanner. IEEE Trans. Very Large Scale Integration Syst. 26 (10), pp. 1868–1880. Cited by: §4.
-  (2015-07) Deep learning with limited numerical precision. In Proc. Int. Conf. Machine Learning, pp. 1737–1746. Cited by: §1, §2.3, §3.2, §3.2, §5, §6.
-  (2016-06) EIE: efficient inference engine on compressed deep neural network. In Proc. Int. Symp. Computer Architecture, pp. 243–254. Cited by: §1, §2.2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.2.
-  (2015) Learning both weights and connections for efficient neural networks. In Proc. Int. Conf. Neural Information Processing Syst., pp. 1135–1143. Cited by: §2.2.
-  (2016-Oct.) Identity mappings in deep residual networks. In Proc. European Conf. Computer Vision, pp. 630–645. Cited by: §4.
-  (2018-06) UCNN: exploiting computational reuse in deep neural networks via weight repetition. In Proc. Int. Symp. Computer Architecture, pp. 674–687. Cited by: §1, §2.2.
-  (2014)(Website) External Links: Cited by: §1.
-  (1998-Apr.) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. Journal Uncertainty, Fuzziness Knowledge-Based Syst. 6 (2), pp. 107–116. Cited by: §2.3.
-  (2015-07) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. Int. Conf. Machine Learning, pp. 448–456. Cited by: §3.
-  (2018-07) Highly scalable deep learning training system with mixed-precision: training Imagenet in four minutes. arXiv preprint arXiv:1807.11205. Cited by: §2.3.
In-datacenter performance analysis of a tensor processing unit. In Proc. Int. Symp. Computer Architecture, pp. 1–12. Cited by: §1.
-  (2016-Oct.) Stripes: bit-serial deep neural network computing. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 1–12. Cited by: §2.2.
-  (2012-Dec.) Imagenet classification with deep convolutional neural networks. In Proc. Int. Conf. Neural Information Processing Syst., pp. 1097–1105. Cited by: §1.
-  (2013-Feb.) Convergence and scalarization for data-parallel architectures. In Proc. IEEE/ACM Int. Symp. Code Generation Optimization, pp. 1–11. Cited by: §2.2.
-  (2018-Jan.) Supporting compressed-sparse activations and weights on SIMD-like accelerator for sparse convolutional neural networks. In Proc. Asia South Pacific Design Automation Conf., pp. 105–110. Cited by: §6.
-  (2018-Sep.) Progressive neural architecture search. In Proc. European Conf. Computer Vision, pp. 19–34. Cited by: §4.
-  (2017-Feb.) FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In Proc. IEEE Int. Symp. High Performance Computer Architecture, pp. 553–564. Cited by: §1.
-  (2017) Modeling the resource requirements of convolutional neural networks on mobile devices. In Proc. ACM Int. Conf. Multimedia, pp. 1663–1671. Cited by: §1.
-  (2018-05) Mixed precision training. In Proc. Int. Conf. Learning Representations, Cited by: §2.3.
-  (2009) CACTI 6.0: a tool to model large caches. HP Laboratories, pp. 22–31. Cited by: §4.
-  (2011-Dec.) Improving GPU performance via large warps and two-level warp scheduling. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 308–317. Cited by: §2.2.
-  (2017-06) SCNN: an accelerator for compressed-sparse convolutional neural networks. In Proc. ACM/IEEE Int. Symp. Computer Architecture, pp. 27–40. Cited by: §1, §2.2.
-  (2016-Mar.) FPGA based implementation of deep neural networks using on-chip memory only. In Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, pp. 1011–1015. Cited by: §1.
-  (2015-07) NVMain 2.0: a user-friendly memory simulator to model (non-)volatile memory systems. IEEE Comput. Archit. Lett. 14 (2), pp. 140–143. Cited by: §4.
-  (2014) A reconfigurable fabric for accelerating large-scale datacenter services. In Proc. Int. Symp. Computer Architecuture, pp. 13–24. Cited by: §1.
-  (2016) Going deeper with embedded FPGA platform for convolutional neural network. In Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, pp. 26–35. Cited by: §1, §2.1.
-  (2018-Feb.) Compressing DMA engine: leveraging activation sparsity for training deep neural networks. In Proc. IEEE Int. Symp. High Performance Computer Architecture, pp. 78–91. Cited by: §1.
-  (2005-Apr.) Capo: robust and scalable open-source min-cut floorplacer. In Proc. Int. Symp. Physical design, pp. 224–226. Cited by: §4.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. Int. Journal Computer Vision 115 (3), pp. 211–252. Cited by: §4.
-  (2016)(Website) External Links: Cited by: §1.
MobileNetV2: inverted residuals and linear bottlenecks.
Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 4510–4520. Cited by: §4, §5.
-  (2014-07) FinCACTI: architectural analysis and modeling of caches with deeply-scaled FinFET devices. In Proc. IEEE Computer Society Annual Symp. VLSI, pp. 290–295. Cited by: §4.
-  (2016-Aug.) Overcoming resource underutilization in spatial CNN accelerators. In Proc. Int. Conf. Field Programmable Logic Applications, pp. 1–4. Cited by: §1.
-  (2009-06) A 5ns fast write multi-level non-volatile 1 K bits RRAM memory with advance write scheme. In Proc. Symp VLSI Circuits, pp. 82–83. Cited by: §2.4, TABLE I.
-  (2016)(Website) External Links: Cited by: §4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.
-  (2015-Sep.) Scalable distributed DNN training using commodity GPU cloud computing. In Proc. Conf. Int. Speech Communication Association, Cited by: §2.3.
-  (2018)(Website) External Links: Cited by: §4.
Inception-v4, Inception-Resnet and the impact of residual connections on learning. In
Proc. AAAI Conf. Artificial Intelligence, Cited by: §4.
-  (2016-06) Rethinking the Inception architecture for computer vision. In Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 2818–2826. Cited by: §4.
-  (2013-05)(Website) External Links: Cited by: §2.4.
-  (2017-06) ScaleDeep: a scalable compute architecture for learning and evaluating deep networks. In Proc. Int. Symp. Computer Architecture, pp. 13–26. Cited by: §1.
-  (2003) Automatic performance tuning of sparse matrix kernels. Ph.D. Thesis, University of California, Berkeley. Cited by: §2.2.
-  (2017-Mar.) Chain-NN: an energy-efficient 1D chain architecture for accelerating deep convolutional neural networks. In Proc. Design, Automation Test Europe Conf. Exhibition, pp. 1032–1037. Cited by: §1.
-  (1991-05) Dynamically scaled fixed point arithmetic. In Proc. IEEE Pacific Rim Conf. Communications, Computers Signal Processing, pp. 315–318 vol.1. Cited by: §2.3.
Accelerating CNN algorithm with fine-grained dataflow architectures.
Proc. IEEE Int. Conf. High Performance Computing Communications; IEEE Int. Conf. Smart City; IEEE Int. Conf. Data Science Syst., pp. 243–251. Cited by: §1.
-  (2018-Dec.) Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992. Cited by: §1, §2.4.
-  (2018-07) Energy-efficient monolithic three-dimensional on-chip memory architectures. IEEE Trans. Nanotechnology 17 (4), pp. 620–633. Cited by: §2.4, §3.3.
-  (2018-Oct.) A monolithic 3D hybrid architecture for energy-efficient computation. IEEE Trans. Multi-Scale Computing Syst. 4 (4), pp. 533–547. Cited by: §2.4, Fig. 9, §3.3.
-  (2019) Software-defined design space exploration for an efficient AI accelerator architecture. arXiv preprint arXiv:1903.07676. Cited by: §5.
-  (2016-Nov.) Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design, pp. 1–8. Cited by: §2.1.
-  (2016-Oct.) Cambricon-X: an accelerator for sparse neural networks. In Proc. IEEE/ACM Int. Symp. Microarchitecture, pp. 1–12. Cited by: §1, §2.2, §5.
-  (2018-Mar.) SparseNN: an energy-efficient neural network accelerator exploiting input and output sparsity. In Proc. Design, Automation Test Europe Conf. Exhibition, pp. 241–244. Cited by: §1, §2.2.
-  (2018-06) Learning transferable architectures for scalable image recognition. In Proc. IEEE Conf. Computer Vision Pattern Recognition, pp. 8697–8710. Cited by: §4.
-  (2017-Oct.) Distributed training large-scale deep architectures. In Proc. Int. Conf. Advanced Data Mining Applications, pp. 18–32. Cited by: §2.3.