A SOT-MRAM-based Processing-In-Memory Engine for Highly Compressed DNN Implementation

11/24/2019 ∙ by Geng Yuan, et al. ∙ Northeastern University University of Connecticut 0

The computing wall and data movement challenges of deep neural networks (DNNs) have exposed the limitations of conventional CMOS-based DNN accelerators. Furthermore, the deep structure and large model size will make DNNs prohibitive to embedded systems and IoT devices, where low power consumption are required. To address these challenges, spin orbit torque magnetic random-access memory (SOT-MRAM) and SOT-MRAM based Processing-In-Memory (PIM) engines have been used to reduce the power consumption of DNNs since SOT-MRAM has the characteristic of near-zero standby power, high density, none-volatile. However, the drawbacks of SOT-MRAM based PIM engines such as high writing latency and requiring low bit-width data decrease its popularity as a favorable energy efficient DNN accelerator. To mitigate these drawbacks, we propose an ultra energy efficient framework by using model compression techniques including weight pruning and quantization from the software level considering the architecture of SOT-MRAM PIM. And we incorporate the alternating direction method of multipliers (ADMM) into the training phase to further guarantee the solution feasibility and satisfy SOT-MRAM hardware constraints. Thus, the footprint and power consumption of SOT-MRAM PIM can be reduced, while increasing the overall system throughput at the meantime, making our proposed ADMM-based SOT-MRAM PIM more energy efficiency and suitable for embedded systems or IoT devices. Our experimental results show the accuracy and compression rate of our proposed framework is consistently outperforming the reference works, while the efficiency (area & power) and throughput of SOT-MRAM PIM engine is significantly improved.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Large-scale DNNs achieve significant improvement in many challenging problems, such as image classification [1], speech recognition [2]

and natural language processing 

[3]. However, as the number of layers and the layer size are both expanding, the introduced intensive computation and storage have brought challenges to the traditional Von-Neumann architecture [4], such as computing wall, massive data movement and high power consumption [5, 6]. Furthermore, the deep structure and large model size will make DNNs prohibitive to embedded systems and IoT devices, where low power consumption are required.

To address these challenges, spin orbit torque magnetic random-access memory (SOT-MRAM) has been used to reduce the power consumption of DNNs since it has the characteristic of near-zero standby power, high density, and none-volatile [7]. Combined with the in-memory computing technique [8, 9], the SOT-MRAM based Processing-in-memory (PIM) engine could perform arithmetical and logic computations in parallel. Therefore, the most intensive operation, matrix multiplication and accumulation (MAC) in both convolutional layers (CONV) and fully-connected layers (FC), can be implemented using bit-wise parallel AND, bit-count, bit-shift, etc.

Compared to SRAM and DRAM, SOT-MRAM has higher write latency and energy [10, 11, 7]

, which decrease its popularity as a favorable energy efficient DNN accelerator. To mitigate these drawbacks, from the software level, weight quantization has been introduced to SOT-MRAM PIM. By using binarized weight representation,  

[10] efficiently processed data within SOT-MRAM to greatly reduce power-hungry and omit long distance data communication. However, weight binarization will cause accuracy degradation, especially for large-scale DNNs. In reality, many scenarios require as high accuracy as possible, e.g., self-driving cars. Thus, binarized weight representation is not favorable.

In this work, we propose an ultra energy efficient framework by using model compression techniques [12, 13, 14, 15, 16] including weight pruning and quantization from the software level considering the architecture of SOT-MRAM PIM. To further guarantee the solution feasibility and satisfy SOT-MRAM hardware constraints while providing high solution quality (no obvious accuracy degradation after model compression and after hardware mapping), we incorporate the alternating direction method of multipliers (ADMM) into the training phase. As a result, we can reduce the footprint and power consumption of SOT-MRAM PIM, and improve the overall system throughput, making our proposed ADMM-based SOT-MRAM PIM more energy efficiency and suitable for embedded systems or IoT devices.

In the following paper, we first illustrate how to connect ADMM to our model compression techniques in order to achieve deeply compressed models that are tailored to SOT-MRAM PIM designs. Then we introduce the architecture and mechanism of SOT-MRAM PIM and how to map our compressed model onto it. Finally, we evaluate our proposed framework on different networks. The experimental results show the accuracy and compression rate of our framework is consistently outperforming the baseline works. And the efficiency (area & power) and throughput of SOT-MRAM PIM engine can be significantly improved.

Ii A Unified and Systematic Model Compression Framework for Efficient SOT-MRAM Based PIM Engine Design

In this section, we propose a unified and systematic framework for DNN model compression, which simultaneously and efficiently achieves DNN weight pruning and quantization. By re-forming the pruning and quantization problems into optimization problems, the proposed framework can solve the structured pruning and low-bit quantization problems iteratively and analytically by extending the powerful ADMM [17] algorithm.

In the meantime, our structured pruning model has a unique (i.e., tiny and regular) spatial property which naturally fits into the SOT-MRAM based processing-in-memory engine utilization, thereby successfully bridges the gap between the large-scale DNN inference and the computing ability limited platforms. After mapping the compressed DNN model on SOT-MRAM based processing-in-memory engine, bit-wise convolution can be executed with very high efficiency consistently.

Ii-a DNN Weight Pruning using ADMM

For an -layer DNN of interest. The first layers are CONV layers and the rest are FC layers. The weights and biases of the -th layer are respectively denoted by and

, and the loss function associated with the DNN is denoted by

; see [18]. In this paper, and respectively characterize the collection of weights and biases from layer to layer .

In this paper, our objective is to implement structured pruning on the DNN. In the following discussion we focus on the CONV layers because they have the highest computation requirements. More specifically, we minimize the loss function subject to specific structured sparsity constraints on the weights in the CONV layers, i.e.,

(1)
subject to

where is the set of with desired “structure”. Next we introduce constraint sets corresponding to different types of structured sparsity to facilitate SOT-MRAM PIM engine implementation.

The collection of weights in the

-th CONV layer is a four-dimensional tensor, i.e.,

, where , and are respectively the number of filters, the number of channels in a filter, the height of the filter, and the width of the filter, in layer .

Filter-wise structured sparsity: When we train a DNN with sparsity at the filter level, the constraint on the weights in the -th CONV layer is given by Here, nonzero filter means that the filter contains some nonzero weight.

Channel-wise structured sparsity: When we train a DNN with sparsity at the channel level, the constraint on the weights in the -th CONV layer is given by channels in Here, we call the -th channel nonzero if contains some nonzero element.

Kernel-wise structured sparsity: When we train a DNN with sparsity at the Kernel level, the constraint on the weights in the -th CONV layer is given by

nonzero vectors in

To solve the problem, we propose a systematic framework of dynamic ADMM regularization and masked mapping and retraining steps. We can guarantee solution feasibility (satisfying all constraints) and provide high solution quality through this integration.

Ii-B Solution to the DNN Pruning Problem

Corresponding to every set , we define the indicator function

Furthermore, we incorporate auxiliary variables , . The original problem (1) is then equivalent to

(2)
subject to

By adopting augmented Lagrangian [19] on (2), the ADMM regularization decomposes problem (2) into two subproblems, and solves them iteratively until convergence.
The first subproblem is

(3)

where . The first term in the objective function of (3) is the differentiable loss function of the DNN, and the second term is a quadratic regularization term of the ’s, which is differentiable and convex. As a result (3) can be solved by standard SGD. Although we cannot guarantee the global optimality, it is due to the non-convexity of the DNN loss function rather than the quadratic term enrolled by our method. Please note that this subproblem and solution are the same for all types of structured sparsities.
The Second subproblem is

(4)

Note that is the indicator function of , thus this subproblem can be solved analytically and optimally [19]. For , the optimal solution is the Euclidean projection of onto . The set is different when we apply different types of structured sparsity, and the Euclidean projections will be described next.

Solving the second subproblem for different structured sparsities: For filter-wise structured sparsity constraints, we first calculate

for , where denotes the Frobenius norm. We then keep elements in corresponding to the largest values in and set the rest to zero.

For channel-wise structured sparsity, we first calculate

for . We then keep elements in corresponding to the largest values in and set the rest to zero.

For kernel-wise structured sparsity, we first calculate

for , . We then keep elements in corresponding to the largest values in and set the rest to zero.

Fig. 1: Bit-wise convolution using SOT-MRAM based PIM engine

Ii-C SOT-MRAM Processing-In-Memory Engine for DNN

The main purpose of SOT-MRAM PIM engine is to convert and perform the MACs operations in convolutional layers using bit-wise convolution format. There are four main steps included in bit-wise convolutions: parallel AND operation, bitcount, bitshift and accumulation. They can be formulated as Eqn.(5). And and stands for the bit-length of inputs and weights respectively. The represents the -bit of all inputs in , where the contains the -bit of all weights in .

Consider a CONV layer with 22 kernel size, where contains 4 weights on a kernel and contains 4 current inputs covered by this kernel. We assume both weights and inputs are using 3-bit representation. From Figure 1 we can see that inputs and weights are mapped to two different sub-arrays. During the computation, two rows are selected from two sub-arrays each time, and the parallel AND results can be obtained from sense amplifiers [20]. Each row in one sub-array will conduct parallel AND operation with all the rows from the other sub-array. And every AND results will go through bit-count and shifter unit, then accumulated with other results to get a bit-wise convolution result [11].

(5)
Fig. 2: Inputs and Weights mapping to the PIM Engine

Ii-D Framework Mapping

In our proposed framework, based on the architecture of SOT-MRAM-based PIM engine, we incorporate structured pruning and quantization using ADMM to effectively reduce the PIM engine’s area and power. Quantization can be integrated in the same ADMM-based framework with different constraints. We omit the details of ADMM quantization due to space limit. Please note that our ADMM-based framework can achieve weight pruning and quantization simultaneously or separately. Moreover, the overall throughput of SOT-MRAM based DNN computing system can be improved as well. The SOT-MRAM-based PIM engine contains several processing elements (PEs). And each PE consists a column decoder, a row decoder, one computing set and multiple SOT-MRAM sub-arrays as shown in Figure 2(1). It also shows how we map the inputs and weights to the PIM engine. In each PE, the inputs will be mapped on one sub-array, and every other sub-array will accommodate the different filters’ weights from the same input channel. For example, the will compute the convolution of the inputs in and all the weights in from to . And the number of columns and rows in each sub-array depends on the kernel size of the network and the bit-length of the inputs and weights respectively. All PEs are able to work parallelly and individually.

In Figure 2(b), examples are given to show the corresponding structured pruning types that are used in our proposed framework and how it facilitates the reduction of PIM engine size. For the filter pruning, all the sub-arrays on the same row (i.e., storing the weights from the same filter) can be removed. Thus, the number of sub-arrays that are contained by each PE will be reduced. For the channel prune, since the pruned channels are no longer needed, the number of required PEs can be reduced without decreasing the throughput. Since each sub-array stores the weights from one channel of one filter, which is also considered as the weights from one convolution kernel, the kernel pruning will remove all the weights on a sub-array. By applying filter pruning and channel pruning, all the pruned sub-arrays or PEs can be physically removed directly. However, removing the sub-arrays by kernel pruning may cause an uneven size between different PEs. But since all the sub-arrays have same size and a sub-array is considered as a basic computing unit in bit-wise convolutions, the control overheads for addressing uneven sub-array numbers in PEs is ignorable. An alternative way is to use a look-up-table (LUT) to mark those pruned sub-arrays and skip them during computation instead of removing them physically.

Each pruning type has its own advantages. The filter pruning and channel pruning has propagation property. Because filter pruning (channel pruning) can not only remove the pruned weights but also removes the corresponding output channels (input channels) as well. By taking the advantage of that, the corresponding channels (filters) in next (previous) layer become redundant and can also be removed. The kernel pruning is especially tailored to the SOT-MRAM-based DNN computing system. Compared to filter and channel pruning, kernel pruning provides higher pruning flexibility, which means it is easier to maintain network accuracy under the same pruning rate. And none of them will incur complicated control logic.

The ADMM based quantization is also used in our proposed framework. In each weight sub-array, the number of rows equals to the bit-length that is used to represent the weights. The number of rows in each sub-array can be evenly reduced by quantizing the weights to fewer bits.

Method
Base
Accuracy
Prune
Accuracy
CONV
Comp. Rate
MNIST (LeNet-5)
Group Scissor [21] 99.15% 99.14% 4.2
Our’s 99.16% 99.12% 81.3
CIFAR-10

RNT-18

DCP [22] 88.9% 87.6% 2.0
AMC [23] 90.5% 90.2% 2.0
Our’s 94.1% 93.2% 59.8

VGG-16

Iterative Pruning [24, 25] 92.5% 92.2% 2.0
2PFPCE [26] 92.9% 92.8% 4.0
Our’s 93.7% 93.3% 44.8
ImageNet

AlexNet

Deep compression [27] 57.2/82.2% 57.2/80.3% 2.7
NeST [28] 57.2/82.2% 57.2/80.3% 4.2
Our’s 57.4/82.4% 57.3/82.2% 5.2

RNT-18

Network Slimming [29] 68.9/88.7% 67.2/87.4% 1.4
DCP [22] 69.6/88.9% 69.2/88.8% 3.3
Our’s 69.9/89.1% 69.1/88.4% 3.0

RNT-50

Soft Filter Prune [30] 76.1/92.8% 74.6/92.1% 1.7
ThiNet [31] 72.9/91.1% 68.4/88.3% 3.3
Our’s 76.0/92.8% 75.5/92.3% 2.7
TABLE I: Structured pruning results on MNIST, CIFAR-10 and ImageNet using VGG-16 and ResNet-18/50 (RNT-18/50 in table). Accuracy results for ImageNet format as Top-1/Top5 accuracy.

Iii experimental results

In our experiment, our generated compressed models are based on four widely used network structures, LeNet-5 [32], AlexNet [1], VGG-16 [33] and ResNet-18/50 [34]

, and are trained on an eight NVIDIA RTX-2080Ti GPUs server using PyTorch 

[35]. For hardware results, we choose 32nm CMOS technology for the peripheral circuits. Cacti 7 [36] is utilized to compute the energy and area of buffers and on-chip interconnects. NVSim platform [37] with modified SOT-MRAM configuration is used to model the SOT-MRAM sub-arrays. The power and area results of ADC are taken from [38].

Several groups of experiments are performed, and we only show one result under each dataset and network, which achieves highest compression rate with minor accuracy degradation. Our results are based on 8-bit quantization, and we use combined pruning scheme (which means all three pruning types are used simultaneously). Table I shows our result on MNIST dataset using LeNet-5 can achieve 81.3 compression without accuracy degradation, which is 19.4 higher than Group Scissor [21]. On CIFAR-10 dataset, our compression rates achieve 59.8 and 44.8 on ResNet-18 and VGG-16 networks with minor accuracy degradation. And on ImageNet, our compression rates for AlexNet, ResNet-18 and VGG-16 is 5.2, 3.0 and 2.7, respectively.

By applying our framework, the power and area of SOT-MRAM PIM engine can be significantly reduced and the overall system throughput can also be improved comparing to uncompressed design, as shown in Figure 3. From our observation, channel pruning usually contributes more power and area saving than filter and kernel pruning, since it can remove entire PE with its peripheral circuits. On the other hand, the filter and kernel pruning can reduce the computing iterations between sub-arrays, which can improve the overall throughput.

Fig. 3: Power/area reduction and throughput improvement over uncompressed models using MNIST and CIFAR-10 dataset.

Iv conclusion

In this paper, we propose an ultra energy efficient framework by using model compression techniques including weight pruning and quantization at the algorithm level considering the architecture of SOT-MRAM PIM. And we incorporate ADMM into the training phase to further guarantee the solution feasibility and satisfy SOT-MRAM hardware constraints. The experimental results show the accuracy and pruning rate of our framework is consistently outperforming the baseline works. Consequently, the area and power consumption of SOT-MRAM PIM can be significantly reduced, while the overall system throughput is also improved dramatically.

References