Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN Implementation

08/27/2019 ∙ by Xiaolong Ma, et al. ∙ Northeastern University George Mason University Florida International University University of Connecticut 19

The state-of-art DNN structures involve intensive computation and high memory storage. To mitigate the challenges, the memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. However, the high accuracy solution for extreme model compression on memristor crossbar array architecture is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating alternating direction method of multipliers (ADMM) algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. Motivated by these discoveries, we design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression ratio on the state-of-art neural network structures with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link https://bit.ly/2VnMUy0.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Structured weight pruning [1, 2, 3] and weight quantization [4, 5, 6] techniques are developed to facilitate weight compression and computation acceleration to solve the high demand for parallel computation and storage resources [7, 8, 9]. Even though models are compressed, computation complexity still burden the overall performance of the state-of-art CMOS hardware applications.

To mitigate the bottleneck caused by CMOS-based DNN architectures, the next-generation device/circuit technologies [10, 11]

triumph CMOS in their non-volatility, high energy efficiency, in-memory computing capability and high scalability. Memristor crossbar device has shown its potential for bearing all these characteristic which makes it intrinsically suitable for large DNN hardware architecture design. A memristor crossbar device can perform matrix-vector multiplication in the analog domain and the computation is in

time complexity [12, 13]. Motivated by the fact that there is no precedent model that is structured pruned and quantized as well as satisfying memristor hardware constraints, in this work, a memristor-based ADMM regularized optimization method is utilized both on structured pruning and weight quantization in order to mitigate the accuracy degradation during extreme model compression. A structured pruned model can potentially benefit for high-parallelism implementation in crossbar architecture. Further more, quantized weights can reduce hardware imprecision during read/write procedure, and save more hardware footprint due to less peripheral circuits are needed to support fewer bits.

However, to achieve ultra-high compression ratio, an ADMM pruning method [3, 14] cannot fully exploit all redundancy in a neural network model. As a result, we design a hardware-software co-optimization framework in which we investigate Network Purification and Unused Path Removal after the procedure of structured weight pruning with ADMM. Moreover, we utilize distilled knowledge from software model to guide our memristor hardware constraint quantization. To the best of our knowledge, we are the first to combine extreme structured weight pruning and weight quantization in an unified and systematic memristor-based framework. Also, we are the first to discover the redundant weights and unused path in a structured pruned DNN model and design a sophisticate co-optimization framework to boost higher model compression rate as well as maintain high network accuracy. By incorporating memristor hardware constraints in our model, our frameworks are guaranteed feasible for a real memristor crossbar device. The contributions of this paper include:

  • We adopt ADMM for efficiently optimizing the non-convex problem and utilized this method on structured weight pruning.

  • We systematically investigate the weight quantization on a pruned model with memristor hardware constraints.

  • We design a software-hardware co-optimization framework in which Network Purification and Unused Path Removal are first proposed.

We evaluate our proposed memristor framework on different networks. We conclude that structured pruning method with memristor-based ADMM regularized optimization achieves high compression ratio and desirable high accuracy. Software and hardware experimental results shows our memristor framework is very energy efficient and saves great amount of hardware footprint.

2 Related Works

Heuristic weight pruning methods [15] are widely used in neuromorphic computing designs to reduce the weight storage and computing delay [16]. [16] implemented weight pruning techniques on a neuromorphic computing system using irregular pruning caused unbalanced workload, greater circuits overheads and extra memory requirement on indices. To overcome the limitations, [17] proposed group connection deletion, which structually prunes connections to reduce routing congestion between memristor crossbar arrays.

Weight quantization can mitigate hardware imperfection of memristor including state drift and process variations, caused by the imperfect fabrication process or by the device feature itself [4, 5]. [18] presented a technique to reduce the overhead of Digital-to-Analog Converters (DACs)/Analog-to-Digital Converters (ADCs) in resistive random-access memory (ReRAM) neuromorphic computing systems. They first normalized the data, and then quantized intermediary data to 1-bit value. This can be directly used as the analog input for ReRAM crossbar and, hence, avoids the need of DACs.

3 Background on Memristors

3.1 Memristor Crossbar Model

Memristor [10] crossbar is an array structure consists of memristors, horizontal Word-lines and Vertical Bit-lines, as shown in Figure 1. Due to its outstanding performance on computing matrix-vector multiplications (MVM), memristor crossbars are widely used as dot-product accelerator in recent neuromorphic computing designs [19]. By programming the conductance state (which is also known as “memristance”) of each memristor, the weight matrix can be mapped onto the memristor crossbar. Given the input voltage vector , the MVM output current vector can be obtained in time complexity of .

3.2 Challenges in Memristor Crossbars Implementation and Mitigation Techniques

Different from the software-based designs, hardware imperfection is one of the key issues that causes the hardware non-ideal behaviors and needs to be considered in memristor-based designs. The hardware imperfection of memristor devices are mainly come from the imperfect fabrication process and the memristor features.

Process Variation. Process variation is one major hardware imperfection that caused by the fluctuations in fabrication process. It mainly comes from the line-edge roughness, oxide thickness fluctuations, and random dopant variations [20]. Inevitably, process variation plays an increasingly significant role as the process technology scales down to nanometer level. In a DNN hardware design, the non-ideal behaviors caused by process variations may lead to an accuracy degradation.

State Drift. State drift is the phenomenon that the memristance would change after several reading opertions [21]. It is known that memristor is a thin-film device constructed by a region highly doped with oxygen vacancies and an undoped region. By nature, applying an electric field across the memristor over a period of time, the oxygen vacancies would migrate to the direction along with the electric field, which leads to the (memristance) state drift. Consequently, an error will incur when the state of memristor drifts to another state level.

It has been proved that applying quantization on memristor-based designs can mitigate the undesired impacts caused by hardware imperfections [22].

Figure 1: memristor and memristor crossbar

4 A Memristor-Based Highly Compressed DNN Framework

The memristor crossbar structure has shown its potential for neuromorphic computing system compared to the CMOS technologies[16]. Due to great amount of weights and computations that involved in networks, an efficient and highly performed framework is needed to conquer the memory storage and energy consumption problems. We propose an unified memristor-based framework including memristor-based ADMM regularized optimization and masked mapping.

4.1 Problem Formulation

ADMM[23] is an advanced optimization technique which decompose an original problem into subproblems that can be solved separately and iteratively. By adopting memristor-based ADMM regularized optimization, the framework can guarantee the solution feasibility (satisfying memristor hardware constraints) while provide high solution quality (no obvious accuracy degradation after pruning).

First, the memristor-based ADMM regularized optimization starts from a pre-trained full size DNN model without compression. Consider an -layer DNNs, sets of weights of the -th (CONV or FC) layer are denoted by . And the loss function associated with the DNN is denoted by . The overall problem is defined by

(1)
subject to

Given the value of , the memristor-based constraint set and ={the weights in the -th layer are mapped to the quantization values}, where is predefined hyper parameters. The general constraint can be extended in structured pruning such as filter pruning, channel pruning and column pruning, which facilitate high-parallelism implementation in hardware.

Similarly, for weight quantization, elements in are the solutions of . Assume set of is the available memristor state value which is the elements in , where denotes the number of available quantization level in layer . Suppose indicates the -th quantization level in layer , which gives

(2)

where are the minimum and maximum memristance value of a specified memristor device.

Figure 2: Illustration of filter-wise, channel-wise and shape-wise structured sparsities.
Figure 3: Structured weight pruning and reduction of hardware resources

4.2 Memristor-based ADMM regularized optimization step

Corresponding to every memristor-based constraint set of and , a indicator functions is utilized to incorporate and into objective functions, which are

for . Substituting into (1) and we get

(3)
subject to

We incorporate auxiliary variables and , dual variables and , and the augmented Lagrangian formation of problem (3) is

(4)

We can see that the first term in problem (4) is original DNN loss function, and the second and third term are differentiable and convex. As a result, subproblem (4

) can be solved by stochastic gradient descent 

[24] as the original DNN training.

The standard ADMM algorithm [23] steps proceed by repeating, for , the following subproblems iterations:

(5)
(6)
(7)

which (5) is the proximal step, (6) is projection step and (7) is dual variables update.

The optimal solution is the Euclidean projection (masked mapping) of and onto and . Namely, elements in solution that less than will be set to zero. In the meantime, those kept elements are quantized to the closest valid memristor state value.

4.3 Memristor-Based Structured Weight Pruning

In order to accommodate high-parallelism implementation in hardware, we use structured pruning method [1] instead of the irregular pruning method [15] to reduce the size of the weight matrix while avoid extra memory storage requirement for indices. Figure 2 shows different types of structured sparsity which include filter-wise sparsity, channel-wise sparsity and shape-wise sparsity.

Figure 3 (a) shows the general matrix multiplication (GEMM) view of the DNN weight matrix and the different structured weight pruning methods. The structured pruning corresponds to removing rows (filters-wise) or columns (shape-wise) or the combination of them. We can see that after structured weight pruning, the remaining weight matrix is still regular and without extra indices.

Figure 3 (b) illustrate the memristor crossbar schematic size reduction from corresponding structured weight pruning and Figure 3 (c) shows physical view of the memristor crossbar blocks. A CONV layer has filters, channels which include total columns, and is denoted as . Due to the increasing reading/writing errors caused by expanding the memristor crossbar size, we limited our design by using multiple 12864 [25] crossbars for all DNN layers. In Figure 3 (c), denote columns and rows for each crossbar, represent inputs and is the column number which is also shown in Figure  3 (a). By easy calculation, one can derived that there’s different crossbars to store one filter’s weights as a block unit. So there’s total blocks to store . Within each block, the outputs of each crossbar will be propagated through an ADC. Then We column-wisely sum the intermediate results of all crossbars.

5 Software-hardware Co-optimization

Due to the existence of the non-optimality of ADMM process and the accuracy degradation problem of quantizing sparse DNN, a software-hardware co-optimization framework is desired. In this section we propose: (i) network purification and unused path removal to efficiently remove redundant channels or filters, (ii) memristor model quantization by using distilled knowledge from software helper.

5.1 Network Purification and Unused Path Removal

Weight pruning with memristor-based ADMM regularized optimization can significantly reduce the number of weights while maintaining high accuracy. However, does the pruning process really remove all unnecessary weights?

From our analysis on the DNN data flow, we find that if a whole filter is pruned, after General Matrix Multiply (GEMM), the generated feature maps by this filter will be “blank”. If we map those “blank” feature input to next layer, the corresponding unused input channel weights become removable. By the same token, a pruned channel also causes the corresponding filter in previous layer removable. Figure 4 gives a clear illustration about the corresponding relationship between pruned filters/channels and correspond unused channels/filters.

To better optimize the unused path removal effect we discussed above, we derive an emptiness ratio parameter to define what can be treated as an empty channel. Suppose is the number of columns per channel in layer , and is channel index. We have

(8)

If exceeds a pre-defined threshold, we can assume that this channel is empty and thus actually prune every column in it. However, if we remove all columns that satisfy , dramatic accuracy drop will occur and it will be hard to recover by retraining because some relatively “important” weights might be removed. To mitigate this problem, we design Network Purification algorithm dealing with the non-optimality problem of the ADMM process. We set-up an criterion constant to represent channel ’s importance score, which is derived from an accumulation procedure:

(9)

One can think of this process as if collection evidence for whether each channel that contains one or several columns need to be removed. A channel can only be treated as empty when both equation (8) and (9) are satisfied. Network Purification also works on purifying remaining filters and thus remove more unused path in the network. Algorithm 1 shows our generalized method of the P-RM method where are hyper-parameter thresholds values.

Figure 4: Unused data path caused by structured pruning
Method Original model Accuracy Compression Rate Without P-RM Accuracy Without P-RM Prune Ratio With P-RM Accuracy With P-RM Weight Quantization Accuracy (8-bit)
MNIST
Group Scissor [17] 99.15% 4.16 99.14% N/A N/A N/A
our LeNet-5 99.17% 23.18 99.20% 39.23 99.20% 99.16%
34.46 99.06% *87.93 99.06% 99.04%
45.54 98.48% 231.82 98.48% 98.05%
*numbers of parameter reduced: 25.2K
CIFAR-10
Group Scissor [17] 82.01% 2.35 82.09% N/A N/A N/A
our ConvNet 84.41% 2.35 84.55% N/A N/A 84.33%
*2.93 84.53% N/A N/A 83.93%
5.88 83.58% N/A N/A 83.01%
our VGG-16 93.70% 20.16 93.36% 44.67 93.36% 93.04%
*50.02 92.73% 92.46%
our ResNet-18 94.14% 5.83 93.79% 52.07 93.79% 93.71%
15.14 93.20% *59.84 93.22% 93.27%
*numbers of parameter reduced on ConvNet: 102.30K, VGG-16: 14.42M, ResNet-18: 10.97M
ImageNet ILSVRC-2012
SSL [1] AlexNet 80.40% 1.40 80.40% N/A N/A N/A
our AlexNet 82.40% 4.69 81.76% 5.13 81.76% 80.45%
our ResNet-18 89.07% 3.02 88.41% 3.33 88.36% 88.47%
our ResNet-50 92.86% 2.00 92.26% 2.70 92.27% 92.20%
numbers of parameter reduced on AlexNet: 1.66M, ResNet-18: 7.81M, ResNet-50: 14.77M
Table 1: Structured weight pruning results on multi-layer network on MNIST, CIFAR-10 and ImageNet datasets. (P-RM: Network Purification and Unused Path Removal). Accuracies in ImageNet results are reported in Top-5 accuracy.
Result: Redundant weights and unused paths removed
Load ADMM pruned model
= numbers of columns per channel
for  until last layer do
        for  until last in  do
               for each:  do
                      calculate: (8), (9);
                     
               end for
              if  and  then
                      prune()
                     prune()  when ;
                     
               end if
              
        end for
       for  until last in  do
               if  is empty or  then
                      prune()
                     prune()  when last layer index;
                     
               end if
              
        end for
       
end for
Algorithm 1 Network purification & Unused path removal

5.2 Memristor Weight Quantization

Traditionally, DNN in software is composed by 32-bit weights. But on a memristor device, the weights of a neural network are represented by the memristance of the memristor (i.e. the memristance range constraint in ADMM process). Due to the limited memristance range of the memristor devices, the weight values exceeding memristance range cannot be represented precisely. Meanwhile, the write-on value and the exact value mismatch when mapping weights on memristor crossbar will also cause the reading mismatch if the amount of the value shift exceeds state level range.

In order to mitigate the memristance range limitation and the mapping mismatch, larger range between state level () is needed which means fewer bits are representing weights. To better maintain accuracy, we use a pretrained high-accuracy teacher model to provide distillation loss to add on our memristor model (referred as student model) loss to provide better training performance.

(10)

The in first term in (10) is the memristor model (student) loss, and in second term is distillation loss between student and teacher. and are outputs of student and teacher and is the ground-truth label. is a balancing parameter, and is the temperature parameter.

Result: distillation quantization with memristor hardware constraints
model pruned and ready to apply quantization;
model with a deeper structure and higher accuracy;
for  until converge do
        ;
        calculate of ;
        back propagate on ;
       
end for
Algorithm 2 Distillation Quantization

6 Experimental Results

In this section, we show the experimental results of our proposed memristor-based DNN framework in which structured weight pruning and quantization with memristor-based ADMM regularized optimization are included. Our software-hardware co-optimization framework (i.e. Network Purification, Unused Path Removal

(P-RM)) are also thoroughly compared. We test MNIST dataset on LeNet-5 and CIFAR-10 dataset using ConvNet (4 CONV layers and 1 FC layer), VGG-16 and ResNet-18, and we also show our ImageNet results on AlexNet, ResNet-18 and ResNet-50. The accuracy of pruned and quantized model results are tested based on our software models that incorporated with memristor hardware constraints. Models are trained on an eight NVIDIA GTX-2080Ti GPUs server using PyTorch API. Our memristor model on MATLAB and the NVSim 

[26] is used to calculate power consumption and area cost of the memristors and memristor crossbars. The 1R crossbar structure is used in our design. And we choose the memristor device that has and . The memristor precision is 4-bit, which indicates that 16 state-levels can be represented by a single memristor device, and two memristors are combined to represent 8-bit weight in our framework. For the peripheral circuits, the power and area is calculated based on 45nm technology. And H-tree distribution networks are used to access all the memristor crossbars.

Figure 5: Effect of removing redundant weights and unused paths. (dataset: CIFAR-10; Accuracy: VGG-16-93.36%, ResNet-18-93.79%)

As shown in Table 1, we show groups of different prune ratios and 8-bits quantization with accuracies on each network structure. Figure 5 proves our previous arguments that ADMM’s non-optimality exists in a structured pruned model. P-RM can further optimize the loss function. Please note all of the results are based on non-retraining process. Below are some results highlights on different dataset with different network structures.

MNIST. With LeNet-5 network, comparing to original accuracy (99.17%), our proposed P-RM framework achieve 231.82 compression with minor accuracy loss while other state-of-art compression ratios are lossless. And no accuracy losses are observed after quantization on 40 and 88 models and only 0.4% accuracy drop on 231.82 model. On the other hand, Group Scissor [17] only has 4.16 compression rate.

CIFAR-10. Convnet structure are relative shallow so ADMM reaches a relative optimal local minimum, so post-processing is not necessary. But we still outperform Group Scissor [17] in accuracy (84.55% to 82.09%) when compression rate is same (2.35). For larger networks, when a minor accuracy loss is allowed, our proposed P-RM method improves the prune ratio to 50.02 and 59.84 on VGG-16 and ResNet-18 respectively, and no obvious accuracy loss after quantization on pruned models.

ImageNet. AlexNet model outperform SSL [1] both in compression rate (4.69 to 1.40) and network accuracy (81.76% to 80.40%), with or without P-RM. Our ResNet-18 and ResNet-50 models also achieve unprecedented 3.33 with 88.36% accuracy and 2.70 with 92.27% respectively. No accuracy losses are observed after quantization on pruned ResNet-18/50 models and around 1% accuracy loss on 5.13 compressed AlexNet model.

Table 2 shows our highlighted memristor crossbar power and area comparisons of ResNet-18 and VGG-16 models. By using our proposed P-RM method, the area and power of the ResNet-18 model is reduced from 0.235 (0.117) and 3.359 (1.622) to 0.042 (0.041) and 0.585 (0.556), without any accuracy loss. For VGG-16 model, after using our P-RM method, the area and power is reduced from 0.113 and 1.611 to 0.056 (0.053) and 0.824 (0.754), where the compression ratio is achieved 44.67 (50.02) with 0% (0.63%) accuracy degradation.

Table 2: Area/power comparison between models with and without P-RM on ResNet-18 and VGG-16 on CIFAR-10

7 Conclusion

In this paper, we designed an unified memristor-based DNN framework which is tiny in overall hardware footprint and accurate in test performance. We incorporate ADMM in weight structured pruning and quantization to reduce model size in order to fit our designed tiny framework. We find the non-optimality of the ADMM solution and design Network Purification and Unused Path Removal in our software-hardware co-optimization framework, which achieve better results comparing to Gourp Scissor [17] and SSL [1]. On AlexNet, VGG-16 and ResNet-18/50, after structured weight pruning and 8-bit quantization, model size, power and area are significant reduced with negligible accuracy loss.

References