Neural networks (NNs) have shown remarkable performance in a broad range of applications, from computer vision[krizhevsky2012imagenet, voulodimos2018deep] to health-care data analysis [stephansen2018neural, miotto2017deep]. These algorithms have gradually evolved to larger models with deeper layers and increasing number of parameters [simonyan2014very].While the large DNNs are powerful and provides high accuracy for complex tasks, they cannot be easily deployed on embedded mobile applications due to two major problems [zhang2016cambricon]. Firstly, because the models are large and often over-parameterized, it must be stored in an external DRAM. The second major issue is that accessing model parameters from an external DRAM consumes large amounts of energy [han2015deep]. For example, in the 45nm CMOS technology, accessing a 32bit DRAM memory requires 640pJ, which is 3 order of magnitudes higher than a 32bit floating point add operation (0.9 pJ) [han2015deep]. Therefore, it is hard to deploy large DNNs on battery constrained mobile platforms.
Network compression via pruning techniques is one possible solution to fit large DNNs such as VGG-16 (138M parameters, 520MB) in on-chip SRAM [li2019squeezeflow, parashar2017scnn, lee2008sparse]. However, it is challenging because sparse matrices add extra levels of irregularity to the weights’ addresses [li2019squeezeflow, chen2019eyeriss]. In [han2015learning]
, a pruning method has been proposed which prunes the connections that have smaller weights than a threshold via an iterative process of pruning and retraining. Although this method prunes the network with no loss of accuracy; the method is heuristic and the threshold values need to carefully selected. Further, resultant matrix lacks structure. In[han2015deep]
, the authors prune the network by learning the important connections using a thresholding mechanism and then applying weight sharing and Huffman coding to compress the networks even further. The problem with this method is that they need to store three vectors: (1) the sparse weights matrix’s value, (2) the location or address of the non-zero matrix weights and (3) the pointer vector to keep track of the weights in each column. In addition, weight sharing adds another level of indirection and complexity[han2016eie]. In summary, the baseline pruning method is powerful from an algorithm perspective, but its mapping to hardware is inefficient and requires a memory foot-print as high as that of the model size.
In this paper, we present a hardware-aware method to prune dense DNNs which reduces the memory footprint while preserving the original accuracy. We utilize an on-die linear feedback shift register (LFSR) using a known seed, to generate a pseudo random sequence (PRS). Next, we use this PRS to regularize and prune the network. In the last step, we retrain the sparse network so that the model can perform better with the pruned model. In addition, during inference we use the LFSR with the same seed to generate the indices in real-time to perform multiplication between the sparse weight matrix and input/activation vector. Consequently, we no longer need to store the sparse weight addresses – thereby reducing the memory foot-print significantly.
2 Proposed Hardware-Aware Pruning
Proposed pruning and baseline methods are illustrated in Figure1. The proposed pruning method consists four steps. It begins with generating a pseudo-random sequence (PRS) using two LFSRs, one for row indices and another one for column indices. We then use the generated PRS as the indices to sparsify the synapses (i.e. connections) by using standard regularization methods in an iterative approach. The specified synapses are regularized to force them to be zero in the training step and be pruned away in the pruning step. The last step is retraining the sparse model so that the sparse model can provide higher accuracy. Using a PRS for generating the locations of the zero weights in the connectivity matrix provides good accuracy, while making it easier to generate the indices on the fly, instead of being stored in a separate memory sub-bank.
We use Han et. al, 2015 [han2015learning] as the state-of-the-art baseline pruning techniques for the proposed algorithm. In [han2015learning], illustrated in Figure 1
, a pruning method was proposed to prune the redundant connections in an iterative process based on a threshold. First, the neural network is trained starting with random initial conditions. Next, the connections less than a threshold are pruned iteratively and finally, the pruned network is fine tuned using several epochs of retraining. Interested readers are pointed to[han2015learning] for further details.
2.1 Linear Feedback Shift Register (LFSR)
LFSR [mita2002novel] is one of the most commonly used topologies for implementing and generating pseudo random bit sequences [qasem2018double]. The advantages of using an LFSR to generate the indices are: (1) they can be easily be implemented in hardware, (2) the PRS has key statistical properties that preserves the rank of the generated connectivity matrix [cusick2017cryptographic]. LFSR, consists of an array of flip-flops with an initial state called input seed (), followed by linear feedback performed by several exclusive-or (XOR) gates (). LFSRs can be mathematically described through order characteristic polynomials:
To obtain the maximum PRS length, the characteristic polynomials have to be primitive [cusick2017cryptographic] where the maximum period without repetition is . In the proposed method we utilize two LFSRs, to generate random sequence for the indices of the rows and columns, separately. The row index indicates the address of the element in the input vector while the column address encodes the address of the output vector.
2.2 Training with PRS based Regularization
A fully connected (FC) layer of DNN performs the following function:
Where is a weight matrix, is an input, is a vector of bias values,muralimanohar2009cacti]. For simplification, can be merged with by appending an additional column to the end of matrix . Then we can write the above equation as:
where c and d are the original weight matrix’s indices correspond to rows and columns, respectively. After selecting the synapses using the PRS sequence, we regularize them during the training process. We apply strong regularization on the selected synapses, based on LFSRs indices, in order to force the network to zero-out these synapses. We have studied both L1 and L2 regularization [han2015learning] to penalize non-zero weight values. For L2 regularization, a regularizer component is added to the cost (J) , as shown in Eq.4 and weights will be updated based on Eq. 5.
Where i and j are correspond to the indices from LFSR 1 and LFSR 2 for rows and columns, respectively. is the learning rate. L is the layer’s number in the network. is the regularization parameter and can be tuned. Larger will more penalized the weights values and make them closer to zero.
2.3 Pruning and Retraining
After heavily regularizing the selected weights, a pruning step is employed to make sure that all the selected weights are exactly zero as regularization makes the connection very close to zero, but not exactly zero. With the LFSR based pruning method, the activation computation of Eq. 6 becomes
where S is a sparse weight matrix where i and j belong to the first and second LFSR, respectively. The last step of the process is to retrain the pruned network to fine tune the remaining synapses so that they better compensate for the removed connections.
2.4 DNN compression
To fully understand advantages and limitations of baseline and proposed algorithms, we implement both methods in digital hardware (Figure 2). The two circuit diagrams show how hardware resources/operations differ to solve a sparsely connected fully-connected layer, with N input neuron, M output neuron and as sparsity. For the hardware implementation of our proposed method, the LFSR is used to generate the index for the input to be multiplied to the corresponding weight in sparse matrix weight. LFSR generate the pseudo random sequence with values between 1 to . In order to keep the values between number of input neurons, we multiply the generated value to the length of input neurons and select the most significant bits (MSBs). The goal is to avoid redundant clock cycles when the generated number is greater that the number of neurons. After doing multiplication and accumulation operations, the result is stored in the output buffer. The index comes from the LFSR with different input seed which is responsible for generating the sequence for output addresses. The exact number of memory reads from the input and the output buffer depend on the model size as well as the number of multiply and accumulate units available for parallel compute. For the baseline algorithm, the sparse weight matrix is compressed in three vectors including the non-zero values of the weights (S), location of the non-zero weights (I) and a pointer vector to point to the start of each column of the weight matrix (P), that should be saved in the memory. Moreover, each entry bit-width of S and I is designed to be four-bit or eight-bits, and additional memory usage ratio resulted from limited index representation is denoted by . For instance, if more than 15 zeros appear before a non-zero four-bit entry, a zero is added to vectors S and I.
3.1 Simulation results for the proposed pruning algorithm
The proposed methodology is demonstrated on three pruned networks: LeNet-300-100, LeNet-5 and VGG-16 using MNIST, CIFAR-10 and down-sampled ImageNet [ILSVRC15] data-sets. Training is carried out on Nvidia GTX 1080 Ti GPUs. The key parameters for hardware implementation are shown in Table 1. Rate of compression along with the top-1 accuracy error before and after pruning for three mentioned models are shown in Table 2. We observe that the proposed method does not affect the rank of weight matrices (Table 3). As the matter of fact the rank in the proposed approach is close to full rank (as in unpruned models) of the dense matrix. Since the PRS maintain the matrix rank, we infer that the expressibility of the weight matrices and accuracy of the network can remain unchanged.
|Technology Node||TSMC 65nm|
|Index Bitwidth||4b, 8b|
|Memory Bank Size||256B, 512B, 1KB, 4KB|
|Modified VGG-16 Unpruned||48.5%||23M|
|Modified VGG-16 Pruned||52.1%||3.3M|
FC layers (unpruned)
3.1.1 Pruning on Fully Connected Layers
Large DNNs are over-parameterized; this mostly correspond to the large fully connected layers of these networks. For instance, 124M out of 138M of the parameters are related to the 3 fully connected layer in VGG-16. Because of this, we focused on pruning fully connected layers’ connection as they consume most of the energy and memory size from hardware perspective.
3.1.2 LeNet on MNIST
LeNet-300-100 is a fully connected network which has two hidden layers of length 300 and 100 neurons each which achieves 4.9% error rate on the MNIST dataset. Accuracy loss versus different sparsity rates on MNIST is illustrated in Figure3. Three different have been tested on the proposed method and the results before and after retraining are illustrated. The results shows that moderate and strong regularization, equals to 2 and 10, respectively, have better performance before and after retraining. We picked
equals to 2 to trade-off between pruning and convergence of the loss function while preventing over-fitting. We use both L1 and L2 regularization and the results have been illustrated3. L1 has better performance before retraining while L2 achieves better performance after retraining.
The second model tested on MNIST is a convolutional network, LeNet-5 that has two convolutional layers followed by two fully connected layers. LeNet-5 achieves 1.6% error rate on MNIST dataset. The accuracy versus different sparsity of LeNet on MNIST is illustrated in Figure 4, which shows that our method achieves the same accuracy as the baseline for different sparsity rates.
3.1.3 LeNet-5 on CIFAR-10
We uses LeNet-5 on CIFAR-10 and the comparison between the accuracy of proposed method and the baseline method for 5 trials is shown in Figure 4. The results shows that the proposed method is more reliable and has less std while preserving the original accuracy as it is not based on the thresholding method.
3.1.4 VGG-16 on down-sampled Imagenet
We used VGG-16 on ImageNet data [ILSVRC15] with 1000 different classes, but initially down-sampled it to [oord2016pixel]
. Apart from a single crop with no rotation, we have not used any other pre-processing or augmentation. For this dataset we have utilized largest batch size, 32 images/batch, allowed by the GPU memory. We then classified down-sampled ImageNet using VGG-16 with some modification to be fit to the spatial size (i.e.) of down-sampled ImageNet which is due to the fact that the feature size should maintain enough spatial size before each pooling layer. The fully connected layers size was changed to 2048 and the last pooling layer was eliminated. The results have shown in Figure 4 which shows that the proposed pruning method can preserve the accuracy even in high sparsity rates.
3.2 65nm CMOS Hardware Implementation
We have synthesized baseline and proposed methods with 65nm CMOS technology to measure hardware metrics. Implementation parameters are shown in Table 1. The pre-layout analysis demonstrates 1.51 to 2.94 reduction in required memory footprint between proposed method and 4-8b indexed baseline pruning technique (Figure 5). Besides memory, the overall system (memory, multiplier, accumulator and input/output buffers) parameters are also measured. The power and area measurements are demonstrated in Table 4 and 5. We observe a maximum of 63.96% power savings and 68.18% area savings across varying sparsity, indexing bit-widths and baseline designs. Although significant savings are observed for the proposed method, it should also be noted that the LFSR based column indexing introduces additional output buffer access (1 cycle read and 1 cycle write). This is included in our design and results. We note that additional power is negligible compared to the total power savings.
Total Saving (%)
Total Saving (%)
In this paper we propose a new method of indexing to use a sparse network for inference, to enhance the memory usage and energy efficiency of DNNs. To achieve that, we have utilized an LFSR based indexing by generating two pseudo random sequence as indices instead of saving the indices in a separate memory. The generated indices are used to decide which weights need to be pruned and which ones to be retained. We show that our method can prune large networks without loss of accuracy. In addition, we demonstrated, a maximum of 63.96% power savings and 68.18% area savings across varying sparsity, indexing bit-widths can be achieved.
This project was supported by the Semiconductor Research Corporation under grant JUMP CBRIC task ID 2777.004, 2777.005 and 2777.006.