1 Introduction
It is well known that crossbar with analogdomain computing naturally boosts the performance of vectormatrix multiplication (VMM), which is the major operation in existing neural networks (NNs) [1, 2, 3]. For this reason, nowadays crossbar architecture based on conventional memory devices (e.g. SRAM [4, 5, 6, 7] and Flash [8]) or emerging memory devices (e.g. RRAM [9, 10, 11, 1, 12, 13, 14], PCRAM [15, 16, 17], MRAM [18], etc.) are being widely used in neural network (NN) accelerators. A functional crossbar is a selfcontained small NN with a weighted connection crossbar and its peripheral units. Many functional crossbars are wired by a 2D scalable routing infrastructure, and this forms a massively parallel socalled manycrossbar architecture. Based on a variety of intercrossbar communication topologies, such as tree[19], 2D triangular toroidal mesh [20] or 2D XY mesh [5, 21], manycrossbar architecture has demonstrated high performance on various NN benchmarks compared to traditional platforms.
Even though the manycrossbar architecture usually performs well in multilayered perceptron (MLP) with dense VMM operations, the efficiency is compromised on convolutional neural networks (CNNs) due to the large amounts of data reuse. To map CNNs onto the manycrossbar architecture, fullyunfolding the reused data is a straightforward solution
[22, 23]. Specifically, it unfolds the reused neuron activations and synaptic weights, and assigns independent resources for all these unfolded cells. It can achieve extremely throughput compared to the fullyfolded mapping that reuses all the neurons and weights
[1] cycle by cycle, however, this scheme consumes significantly huge crossbar resources. For example, more than 310 crossbars are needed to achieve comparable accuracy on mediumscale CIFAR10 dataset [24] even if a network pruning approach has been leveraged [22]; more than hundreds of thousands of crossbars [23] are often occupied for larger models [25, 26, 27] on ImageNet dataset [28]. Although a compromised solution, termed as semifolded mapping [29], has emerged to achieve a balance between the execution throughput and resource overhead, the resource and the resulting energy cost are still very high. This becomes a fatal blow for the compact and energylimited embedded devices.Recently, many works on NN pruning seem promising to shrink large models to reduce the resource and energy consumption. In detail, various sparsity structures from the element grain [30, 31], element group grain [32], vector grain [33, 34], and even channel grain [35, 36, 37, 38], have been proposed. On one side, although the sparse network obtained from channelwise pruning can be reorganized as a compact dense model for any architecture, the accuracy loss is usually high due to the aggressive sparsity. On the other side, almost all the smallergrain pruning works consider only the original logic computational graph or the execution graph on general purpose platforms (e.g. CPU/GPU), instead of the execution graph after mapping onto the crossbar architecture. Although it is possible, also widely used, to save energy via adding compute gate benefit from the vast zeros in a sparse network, the crossbar still cannot be removed due to the residual irregular nonzeros. Therefore, we have to leave its costly peripheral circuits. It is difficult to efficiently leverage these finegrained sparse models in practical crossbar devices. In this sense, reducing the number of crossbars as much as possible is the most effective way to save the resource and energy cost. With this guidance, previous work attempted to obtain crossbaroriented pruning using iterative clustering method [39], but the fullyconnected (FC) layer was the focus. The effective pruning of Conv layer on crossbars is still a remaining challenge, since the Conv mapping is more complex than the intuitive FC mapping due to the data reuse and the pruning difficulty is increased due to the less redundancy.
Motivated by the above analysis, we try to answer one important yet untouched question, i.e. how many crossbars can we reduce when mapping CNNs onto the crossbar architecture? To this end, a crossbaraware pruning framework is proposed to minimize the resource overhead on crossbar architecture for CNNs. Firstly, two crossbarfriendly sparsity grains, crossbar grain and column grain, are designed based on the semifolded mapping method. The former sparsity is straightforward and easy to map, and the latter one can be converted to the crossbargrain sparsity by recombining nonzero columns along the output dimension. Secondly, we formulate the crossbaraware pruning as an norm constrained optimization problem, and propose an norm constrained gradient descent (LGD) with relaxant probabilistic projection (RPP) method to solve it. Thirdly, a reorder of input FMs is proposed to enhance the model accuracy. We conduct a comprehensive analysis on the resourceaccuracy tradeoff on various benchmarks, including mediumscale CIFAR10 and largescale ImageNet datasets with VGG and ResNet models. Overall, our crossbaraware pruning framework is efficient for crossbar architecture, which is able to reduce 44%72% crossbar overhead with acceptable accuracy degradation. This paper provides a new codesign solution for mapping CNNs onto various crossbar devices with significantly higher efficiency.
The contributions of this work are briefly summarized as follows:

We propose an effective pruning method for CNN implementation on crossbars, which can significantly reduce the resource overhead.

The LGD solver is able to control the resulting sparsity accurately, and the probabilistic projection helps improve the algorithm convergence.

The effectiveness of our method was evaluated on various datasets, including the largescale ImageNet.
The rest of this paper is organized as follows: Section II introduces the crossbar architecture and the mapping strategy; Section III explains the details of our crossbaraware neural network pruning framework; the technique analysis and performance evaluations are reported in Section IV ; finally, Section V concludes and discusses this paper.
2 Crossbar architecture and CNN Mapping
Normally, the crossbarbased architecture is a hierarchical and scalable system. In such architecture, a basic building block, called functional crossbar (Func), consists of a crossbar and its peripheral circuits (e.g. read/write driver, ADC/DAC, activation function and pooling circuits, timing scheduler, etc.) and can efficiently perform the vectormatrix multiplication (VMM). As shown in Figure
1, the Func units are connected by a routing structure to realize communication with each other.2.1 ManyCrossbar Architecture
The key component in each Func is the crossbar array. The weight parameters program the conductance of the memory array. After this mapping initialization, the input neuronal activations are injected into the crossbar word lines, and an inplace vectormatrix multiplication (VMM) can be performed by applying voltages to each row and reading currents from each column. Specifically, crossbar computing is in analog domain, thus many peripheral circuits are required for the complete functionality. For example, write and read driving circuits are used to program the memory cell and read the output state, respectively. Digitaltoanalog converter (DAC) and analogtodigital converter (ADC) switch signal format between the crossbar analog domain and the peripheral digital domain. In addition, additional digital circuits are required for activation function, pooling operation, and the timing schedule. Note that the memory cell can have many variants, including conventional memory devices with specialized modification (e.g. SRAM [4, 5, 6, 7] and Flash [8]) or emerging nonvolatile memory (NVM) devices with different material hierarchy and resistive mechanism (e.g. RRAM [9, 10, 11, 1, 12, 13, 14], PCRAM [15, 16, 17], and MRAM [18]). Finally, each Func is equipped with a router for intercrossbar communication which guarantees the scalability to a largescale system.
Variable  Definition  Variable  Definition 

number, index of input FM  number, index of input FM group  
number, index of output FM  number, index of output FM group  
set of input FMs’ indexes in th input FM group  number of FMs in each input FM group  
set of output FMs’ indexes in th output FM group  number of FMs in each output FM group  
weight kernel connecting input and output FM  pruning mask value involving th input and th output FM group  
weight tensor of th output FM 
,  pruning coefficient, mask vector involving th output FM group  
th input FM  ,  norm of , relaxation factor in RPP  
initial convolution result of  set of involving th output FM group  
partial summation of involving th input FM group and th output FM  set of involving th output FM group  
Complete summation of involving th output FM  number of image samples  
Tensor representation of input FMs for linear regression 
concatenation of all tensors in  
Tensor representation of output FMs for linear regression  concatenation of all tensors in  
convolution operation  the size of output FM 
2.2 Semifolded Mapping of Convolution Layer
Because the crossbar architecture can improve the VMM performance, the FC layer can gain benefits naturally. However, some operations in neural network such as convolution cannot be directly deployed due to the large data reuse. Conventionally, there exist two schemes for convolution mapping on the crossbar architecture: fullyunfolded [22, 23] and fullyfolded [1]. The former one that is widely used in neuromorphic field unfolds all the memory and computation, and then transforms them to VMM for crossbar execution. On the other hand, the latter one only assigns the physical resources for one sliding window, then it reuses these resources cycle by cycle until all the sliding windows are finished. Overall, the fullyunfolded mapping consumes large amount of resources to achieve high throughput, while the fullyfolded mapping consumes plenty of clock cycles to achieve minimum resources. In general, the resulting speed and overhead are greatly imbalanced. To address these issues, a semifolded mapping [29] is proposed recently, which simultaneously folds the physical resources along the row dimension of feature maps (FMs) for resource saving and unfolds them along the column dimension for maintaining parallelism. Therefore, it can balance performance and overhead to a great extent. However, as mentioned earlier, it still consumes lots of resources. In this paper, we implement our pruning methodology based on the semifolded mapping scheme to further reduce the overhead.
Since the mapping of FC layer is much easier, we focus on the illustration of Conv mapping in this section. But note that it is quite easy to extend our method to FC layer. The variable definitions are listed in Table 1. Figure 2(a) presents an example of a Conv layer, where both the number of input and output FMs are four, i.e. . The size of each input and output FM is and , respectively. Here we just take this as an example. In fact the FM size can be much larger in semifolded mapping. Specifically, the FM height has no limitation since the crossbar is reused along the FM row direction, and FM with larger width can be split onto independent crossbars for execution parallelism. We use (orange rectangles) and (green rectangles) to denote the input and output FM tensors (here and ), respectively. In Conv layer, weight (blue rectangles) is a four dimensional tensor, in this example we have . Each output FM is the convolution result between the input FM tensor and the weight filters .
Figure 2(b) illustrates the semifolded mapping of this example. We assume that the size of each crossbar is (12 rows and 4 columns) for simplicity. According to the mapping results, we can divide the input and output FMs into two groups , such that each inputoutput group pair occupies one crossbar. We use and to indicate the group index (here and ). Each crossbar just generates the partial sum of the intragroup output FMs, and the additional accumulation of the results from the crossbars in one column can obtain the complete activations of output FMs. In this example, totally four crossbars are required to implement the Conv layer.
Figure 2(c) shows the matrix representation of the crossbar mapping, where it is seen that each crossbar realizes eight convolution sliding windows, and each sliding window covers a single input FM with one of its weight kernel. And each crossbar generates a set of partial sums corresponding to its output FM group. Then we present all the corresponding calculations involved in Fig.2 as follows
(1) 
(2) 
(3) 
where accumulates the initial convolution result in the same input FM group, and we collect the indexes of input FMs in th group to as an index set . Furthermore, accumulates across all input FM groups to generate a complete sum corresponding to the th output FM. In short, the complete sum of each output FM is the summation of all partial sums from its corresponding crossbars. Note that the data size of and are identical, which are same with the size of output FM, , . Here usually equals to FM height multiplying FM width. Based on the grouping of input FMs on crossbars, the initial convolution result shrinks to partial sums . Equation (3) organizes the tensors of partial sum and complete sum in the th output FM group as tensor sets, which represent all the partial sums from the (, )th crossbar as and the complete sums involving the th output FM group as . The index set of output FMs in th output FM group is denoted as .
3 Crossbaraware Pruning Framework
In this section, we will present a comprehensive analysis of how to explore the crossbaraware sparsity with two grain levels: crossbargrain and columngrain. Then we formulate our pruning problem as an norm constrained optimization problem and propose an norm constrained gradient descent (LGD) to solve it. Finally, we introduce our input FM reorder method to improve the model accuracy of sparse networks.
3.1 Crossbargrain and Columngrain Sparsity
In Conv layer, the complete sum on one output FM is the summation of its partial sums produced by the convolution between all input FMs and its corresponding weight filter. The sparsity analysis is to identify that which partial sums contribute less and then prune them. After that, the remained partial sums are used to approximate the original output FMs. In this work, we design two pruning grains: crossbargrain sparsity and columngrain sparsity which provide usable sparsity for the crossbar architecture.
In the crossbargrain sparsity, our goal is to eliminate crossbars whose partial sum contributes less to its output FM group . We design a binary pruning mask to indicate whether the crossbar will be pruned () or not. Figure 3(a) shows an example of crossbargrain pruning. In this case, the right top crossbar and the left bottom crossbar are pruned (marked as shadow). In detail, we initially have
(4) 
After pruning, each output FM can be calculated from the remained input FM group, i.e.
(5) 
Then we can formulate the approximated function after pruning as follows
(6) 
In contrast to original dense accumulation, the accumulation of the partial sum from each crossbar is controlled by a binary pruning mask . Each is shared by all partial sums in the same th output FM group, which is usually in the same crossbar.
Although the crossbargrain sparsity can intuitively eliminate redundant crossbars, the output FM group after pruning may stray from original output FM group a lot, if the dependency among elements in is poor. Considering the resourceaccuracy tradeoff issue, a straightforward method is to shrink the size of which is determined by the crossbar size in crossbargrain sparsity. By reducing the number of output FMs in each output FM group, the dependency requirement among output FMs in the same output FM group can be mitigated. Thus, we further propose columngrain pruning to decrease the sparsifying grain for error reduction, which is shown in Figure 3(b). Now, each output FM group only contains one output FM, i.e. and . After pruning, each crossbar usually has nonzero columns, not fully pruned like that in crossbargrain pruning. In the example, the first input FM group contributes to the first output FM group and the fourth output FM group . So the first half columns in the left top crossbar can be recombined with the last half columns in the right top crossbar to form a new crossbar. The input of this new crossbar is still the first input FM group , whereas, after recombination the crossbar output is the first output FM group and the fourth output FM group . The crossbars involving the second input FM group can be shrunk by using the same pruning and recombination method. Furthermore, if each output FM has multiple nonpruned crossbars to receive its input FMs, the intercrossbar accumulation of the partial sums is still required (Figure 3 omits this case for clarity). Usually, the columngrain pruning can achieve the similar sparsity with the crossbargrain pruning, but with significantly higher accuracy since the pruning is more elaborate.
3.2 norm Constrained Optimization Problem
In previous section we have analyzed the sparsity for crossbar architecture, in which we expect the summation of remained partial sums can approximate to the original as closely as possible. For the convenience of expression, here we concatenate all the complete sums in into a tensor and reshape it to a vector, then denote it as . is also obtained after similar concatenation and reshaping of all the partial sum sets involving th output FM group. In this section we simplify the number of image samples to 1 for clarity, i.e. . The objective function of the pruning optimization problem of the th output FM group can be described as
(7) 
where is the norm (number of nonzero elements, i.e. ) of the binary pruning mask , which determines the sparsity. The loss is the square of Euclidean distance between complete sums and the sparsified partial sums . After transforming the pruning issue into an norm constrained optimization problem, the object now is to find a binary pruning mask that satisfies the norm constraint and minimizes the distance. However, the norm constrained problem is an NP problem. In this work we propose norm constrained gradient descent (LGD) to solve it.
Before introducing the whole flow of how the LGD works, we discuss a relaxant probabilistic projection (RPP) method first. The RPP can transfer a pruning coefficient vector to the corresponding pruning mask whose norm is (not binary at this stage). Naturally, the largest elements in indicates the most contributions. However, this intuitive and deterministic selection is so dictatorial that completely cuts off the opportunity of the ones that are out of the largest elements but still large enough. Therefore, we introduce a relaxation factor and a probabilistic projection method to determine the nonzero elements, which is inspired from [40]. The detailed process is shown in Algorithm 1. The initial candidates are the largest elements, instead of . Then the RPP iteratively selects elements from the candidates through probabilistic sampling until the norm of reaches . The sampling probability is proportional to the absolute value of (i.e. ). At each iteration, the selected elements will be removed from the candidate set.
Based on the RPP, now we explain the LGD for solving the norm constrained optimization problem in Equation (7). The overall idea is to integrate the RPP into the normal gradient descent at every iteration. Algorithm 2 shows the whole process. The gradient descent is governed by
(8)  
which is a modified version that frequently switches between the full space of and norm constrained space of . The space switching is implemented by the aforementioned RPP. Note that at each LGD iteration, is scaled by a factor
to minimize the loss function in Equation (
7) through linear regression. In this work, the number of iterations for LGD is set to 50. At the end, we binarize all the elements in
, which generates the final binary pruning mask. means the crossbar (crossbargrain) or the crossbar columns (columngrain) connecting the th input FM group and th output FM group can be removed.Figure 4 illustrates the LGD work flow and shows an RPP example. It starts with a randomly initialized . In each outer loop, the pruning coefficient vector is firstly calculated through gradient descent, and then the norm constrained vector is updated through RPP. The inner loop demonstrates how RPP works. Suppose the length of coefficient vector is 6, , and . The rectangle with darker shade indicates larger absolute value. Since the absolute value of and are smaller than others, they are screened out of candidate set at the beginning, i.e. . The rest four elements () forms the candidate set. Through the probabilistic sampling strategy in RPP, and are sampled and removed from the candidate set at RPP iteration 0; furthermore, is selected from the candidate set of at the next iteration. Because the norm of has reached 3, the RPP ends after two iterations in this example. It is interesting that, although we have , is finally selected. This indicates that the RPP gives the sublargest elements some opportunities, and provides better convergence.
In a nutshell, the proposed LGD along with RPP has several advantages: 1) it is able to convert the NPhard norm constrained optimization problem to be approximately solvable; 2) it can accurately control the final sparsity by tuning the value of ; 3) the smart relaxation factor can provide better convergence by introducing probabilistic projection.
3.3 Input FMs Reorder
The crossbaraware pruning considers the grouping effect of crossbars. The above analysis does not consider how to group the input or output FMs. Recall that in Figure 3, the order of output FMs has no influence on the final results because they are independent. However, the order of input FMs matters because all the crossbar columns share the same input rows. In above sections, we use the simplest original order when mapping the input FMs. In this section, we design a reorder strategy to tune the input FM order and reduce the pruning error.
The reorder of input FMs tries to increase the possibility that the more important input FMs can be clustered into the same group. In this way, the impact of pruning other less important groups on model accuracy can be reduced. Usually, a larger partial sum indicates more contribution to final complete sum of output FM. The importance of each input FM is identified by summing all with the same , which is calculated by
(9) 
Next, we reorder the input FMs according to the importance values. The following pruning process is the same as that without reorder. Note that we do not take the absolute value of , since the crossbar output tends to be zero, i.e. little contribution, if the largest positive and the smallest negative (with similar absolute value) fall into the same crossbar. This will lead to less distinguishable between the large partial sums and small partial sums.
The input FMs reorder works well for the front layers, however the deeper layers cannot benefit much from this technique. This might be because that in deep layers, each input FM has extracted highlevel abstraction and has ‘equal’ importance. We provide a detailed discussion in Section 4.
3.4 Crossbaraware Singlelayer Level Pruning
Our pruning framework runs based on a pretrained model. For the singlelayer level pruning in this section, other layers keep unchanged except the layer to be pruned. Before implementing the pruning methodology presented in Algorithm 3, the order of input FM groups should be fixed (with or without reorder). The layer pruning is conducted group by group along the output FM dimension (from to ). Suppose images are sampled for pruning. and are the concatenation tensor representation of the partial sum and complete sum, respectively, involving the th output FM group. For each output FM group, we firstly reshape and to a matrix and vector format, respectively. Then, we run the LGD to generate the binary pruning mask for pruning the crossbars (crossbargrain) or crossbar columns (columngrain). In the latter one, we need to recombine the nonzero crossbar columns to assemble new crossbars, like that in Figure 3. Note that as many pruning works do, a final linear regression is required to tune the weights [38, 35]. In particular, according to the pruning mask, we can collect the required input and output data for linear regression as and , respectively. Then, the remained weights involving the th output FM group are tuned by
(10) 
Note that in Algorithm 3 is the size of each weight kernel, e.g. 33, and we have and .
For clarity, Figure 5 compares the sparse pattern resulted from five different pruning strategies. In Figure 5(a)(c), the matrix corresponds to the GEMM weight matrix after transforming the Conv operation into matrixmatrix multiplication [41, 42], rather than the crossbar grids in Figure 5(d)(f). Although the elementwise pruning [30, 31] and vectorwise pruning [33, 34] can produce unstructured and structured sparsity, respectively, they only consider the original logic computational graph or the execution graph on general purpose platform (e.g. CPU/GPU). Therefore, they cannot be directly used by crossbar devices. The crossbargrain pruning is straightforward to be used by the crossbar architecture, whereas, it suffers from larger accuracy degradation. The columngrain pruning produces sparsity in crossbar columns (not the matrix columns in the vectorwise pruning) which improves the accuracy with smaller pruning grain. By recombining the remained nonzero crossbar columns, columngrain pruning is able to remove many redundant crossbars.
Note that the FC layer can also adopt our pruning framework to reduce crossbar overhead. FC layer corresponds to 11 FM size in Conv layer. Since the number of output neurons in FC layer is large (e.g. 2048), we set in this case instead of employing the columngrain sparsity for Conv layer directly. The recombination of the nonzero columns is still used in FC layer pruning.
3.5 Crossbaraware Network Level Pruning
A key principle in neural network pruning is to balance the model sparsity and accuracy loss [36]. Distinguishing from previous work, we consider the tradeoff between crossbar overhead and accuracy loss. Since the tolerance to sparsity for each layer is variable, different layers should adopt different pruning ratios (sparsity of ) to minimize redundant crossbars as much as possible, while maintaining the model accuracy.
In network pruning we will first try singlelayer pruning (each time only one layer is pruned) under different pruning ratios (i.e, from 20% to 70% with 5% interval) based on Algorithm 3. Then each layer’s tolerance to sparsity can be reflected from the accuracy drop curve. Thus, we determine how many crossbars can be pruned according to a predefined accuracy drop threshold . Specifically, we choose the pruning ratio that has the closest accuracy drop to (but larger than ) for each layer.
Based on the initial pruning ratio, we further design three conditions to finally stop the singlelayer pruning and determine the pruning ratio for every layer: 1) when the accuracy drop is larger than a threshold ; 2) when the pruning ratio is larger than a threshold ; 3) when the number of remained crossbars are smaller than a threshold
. Once any of them is satisfied, the singlelayer pruning stops and the pruning ratio is finally fixed for that layer. The first condition is designed to maintain the final accuracy after pruning. The later two conditions try to control the resource overhead for each layer. There is no need to continue pruning if the pruning ratio in that layer is already small. In the large network like VGG16, the crossbar overhead presents a large variance among different layers due to different layer size. Therefore, the pruning ratio is not enough to accurately reflect the crossbar overhead. To this end, we introduce an additional threshold
to jointly control the resource consumption. A detailed example can be found in Figure 10. Then, we prune the whole network layer by layer according to the finally determined pruning ratios of all layers. A finetuning step is also required to restore the network accuracy at the end as previous work [30, 31].4 Experimental Results
4.1 Experiment Setup
CNN Model. We validate our experiments on CIFAR10 [24] and ImageNet [28] datasets. We use three CNN models to run our crossbaraware pruning. On CIFAR10 dataset, an eightlayer network (VGG8 [43, 44]) is adopted, and on ImageNet dataset, VGG16 [26] and ResNet18 [27] are used.
Crossbar Platform. For onetoone comparison, we adopt the same semifolded mapping compiler for the neural network chip in [29]. To reduce the development period and save fabrication cost, we use the offtheshelf SRAM array with extra multipliers and accumulators to simulate the crossbarbased VMM engine, like that in [5] and [6]. Please note that our crossbaraware pruning is a general framework for any crossbar architecture.
Pruning Configuration. In the following experiments, 5000 images are sampled from the training dataset, and each class have the same number of images. For each image, 10 points are sampled from each FM (size of ) for LGD, and we change 10 to 2 for the final linear regression.
4.2 Analysis of Singlelayer Level Pruning
In this section we will use the pruning result for single layer to analyze the influence of , , pruning ratio, and the input FM order. The analysis conclusions in this section can guide the next network pruning. Because the accuracy gap among different configurations could be narrowed down via finetuning technique, which will impede the sensitivity study, we abandon the final finetuning step in this section.
Figure 6 shows the accuracy drop for two layers in VGG16 (Conv22 and Conv42) under different , settings and pruning ratios (input FMs are grouped with reorder). As previous analysis in Section 3.1, decreasing can mitigate the dependency requirement among each in the same th output FM group and then reduce the pruning error. Apparently, the smallest accuracy drop in both Conv2_2 and Conv4_2 are produced from the columngrain sparsity (). This is consistent with our prediction in Section 3.1 that columngrain sparsity can achieve higher accuracy than crossbargrain sparsity (). For the smaller pruning ratio (50% and 60%), the accuracy difference between and is not obvious. In contrast, for large pruning ratio (70%), overally performs worse than . This indicates the influence of will be amplified in the case of large pruning ratio. Generally speaking, smaller often brings better accuracy. Similar to , also impacts the model accuracy. Larger usually suffers from larger accuracy drop due to the coarser pruning grain. However, different brings very different resource overhead in the semifolded mapping framework [29] and it is deeply coupled with the mapping implementation. To simplify the problem and focus on our pruning framework, we initially determine for each layer towards the minimum crossbar overhead when we mapping the dense model onto crossbar, and fix them during the next pruning process. The configuration details can refer to previous work [29].
Since the columngrain pruning can obviously outperform the crossbargrain pruning from the perspective of model accuracy with the same sparsity, we will focus on the columngrain pruning in the following experiments. In other words, we will set .
Next we study the influence of the reorder of input FMs. We select five layers in VGG16 to present the accuracy change by adding the reorder operation. The index reorder is according to the descending order of the importance of input FMs calculated by Equation (9). As shown in Figure 7, input FMs reorder significantly improves the accuracy in the front layers, especially for large pruning ratio (70%). While in deep layers, the improvement is degraded. As mentioned in Section 3.3, the reason might be that in deep layers, each input FM has extracted highlevel abstraction and has ‘equal’ importance.
In order to clearly visualize how input FMs reorder works, Figure 8 shows the contribution rate of input FM groups from three difference layers in VGG16 with or without input FMs reorder. Here the contribution rate of input FM group indicates the percentage of output FM groups that uses this input FM group (i.e. it has contribution). It can be calculated by . The global pruning rate in this figure is fixed at 50%. For the original index order, the contribution rate of input FM groups shows a fluctuated evenly distribution, which means no input FM group performs absolutely important or unimportant. This will cause larger pruning error. In stark contrast, after using the reorder strategy, we can cluster the trivial and nontrivial FMs into different FM groups. This phenomenon is quite clear in the front layers. For example, in Conv1_2, the contribution rates in middle input FM groups are much lower than those in head and tail, which indicates we successfully clustered the nontrivial input FMs into the two terminals and pruning the middle trivial input FMs can still maintain the model accuracy well. However, it should be noted that as the layer goes deeper, most of input FM groups tend to be equally important even if the ones in head and tail still have a little higher contribution rate. The reason has been explained earlier that each FM extracts highlevel abstraction in deep layers so that all of them probably have necessary contribution to the final recognition.
4.3 Analysis of Network Level Pruning
In this section we will conduct the pruning of the whole network. In our network pruning the first Conv layer and the last FC layer are not pruned. The reason is that these two layers serve as the first feature extraction layer and the last classification layer, pruning of which renders significant accuracy drop. Additionally, the resource cost in these two layers is negligible, so there is no need to prune them.
Before stepping into our network pruning strategy described in Section 3.5, we will first discuss the variance of sparsity tolerance in each layer, as well as the relation between accuracy drop and crossbar overhead. Figure 9(a) shows the accuracy drop when separately pruning each layer under different pruning ratio in VGG8. When the pruning ratio is less than 70%, all layers can maintain the accuracy drop within 3%. The accuracy drop will significantly increase when the pruning becomes more aggressive. Interestingly, different layers present different tolerance to sparsity. For example, Conv2, Conv6, and FC1 seem more robust compared to other layers. Figure 9(b) depicts the accuracy drop and resource overhead when pruning the whole network under different pruning ratio. For simplicity, here we prune all the layers with the same pruning ratio. The accuracy drop is controlled within 0.5% when the pruning ratio is smaller than 50%. Similarly, more aggressive pruning will cause significantly increasing accuracy degradation. For the crossbar overhead, nearly linear resource saving is observed as pruning ratio increases. Only the compute crossbars are reduced, and the crossbars for input buffer and output reduce cannot be saved. More details can be found in the original semifolded mapping paper [29]. Note that Figure 9(a) doesn’t use the finetuning technique but 9(b) does. The underlying reason is that the singlelayer pruning aims to help sensitivity analysis so we expect larger accuracy gap, while the network pruning always wants higher accuracy for practical implementation.
Pruning  0% (dense)  10%  50%  90%  

Ratio  T  C  T  C  T  C  T  C 
Conv2  176  128  164  116  124  76  72  32 
Conv3  224  160  208  152  156  100  88  36 
Conv4  604  512  570  478  376  284  168  76 
Conv5  364  304  352  296  222  166  94  38 
Conv6  698  592  682  576  424  318  182  76 
Fc1  167  128  161  122  120  81  71  32 
Accuracy  94.29%  94.27%  93.84%  90.18%  
T  2484  2388  1673  926  
C  1954  1870  1155  420 
Total Crossbar  Compute Crossbar  Stop Conditions  

ImageNet  0% (dense)  case1  case2  optimized case  0% (dense)  case1  case2  optimized case  
VGG16  25801  16428  14319  12749  21720  12403  10294  8724  , , 
ResNet18  7300  5148  5815  5107  4724  3718  3017  2634  , , 
CIFAR10  0% (dense)  case1  case2  optimized case  0% (dense)  case1  case2  optimized case  
VGG8  2484  1320  1092  1056  1954  802  582  546  , 
In addition to maintain the model accuracy as high as possible after pruning, we also want to maximize the resource reduction. In Table 2, we select three pruning ratios to show the crossbar overhead in each layer. Conv6 can reduce 516 compute crossbars if the pruning ratio reaches 90%, however, under the same pruning ratio Conv3 can only save 124 compute crossbars. From both accuracy drop in Figure 9(a) and resource saving perspectives, network pruning can benefit more from layer Conv6. Instead of pruning each layer with the same pruning ratio, a higher pruning ratio in Conv6 and a lower pruning ratio in Conv3 may achieve the same accuracy after pruning but reduce more resources.
Motivated by above analysis, we introduce three conditions to determine the pruning ratio for each layer. Based on the conditions of (accuracy drop), (pruning ratio), and (crossbar overhead) in Section 3.5, Figure 10 shows three layer examples in VGG16 for explaining how it works. Here we set , and . For Conv3_1 layer, the final determined pruning ratio is 60%. Although it doesn’t exceed the accuracy drop threshold and the pruning ratio threshold , the crossbar overhead threshold has been reached. Continuing to prune cannot gain more significant resource reduction. Similarly, Conv4_2 selects pruning ratio of 65% ( condition) and Conv5_2 selects 55% ( condition).
Figure 11 and Table 3 show the final results of the whole network pruning with optimized pruning configuration. Here ‘case1’ means the determined pruning ratio for each layer causes accuracy drop when conducting singlelayer pruning. Similarly, ‘case2’ corresponds to in VGG16 and ResNet18 and in VGG8. The ‘optimized case’ is the final pruning configuration determined by the three stop conditions for accuracyresource balance. In VGG8 (CIFAR10), since the network size is much smaller than those on ImageNet, there is no need to design the crossbar threshold
. 30 extra epochs are used to finetune the models on ImageNet (learning rate
for the first 20 epochs and for the last 10 epochs), and 60 epochs are used for VGG8 on CIFAR10 (learning rate for the first 30 epochs and for the last 30 epochs). In general, the optimized case can reduce more resources, but at the cost of more accuracy degradation. Finally, we can save 59.8%, 44.2%, and 72.1% compute crossbars in VGG16, ResNet18, and VGG8. Furthermore, we find that the ResNet18 is more sensitive against pruning than the VGG networks due to its compact structure.5 Conclusion and Discussion
In this work we propose a crossbaraware pruning framework to introduce usable sparsity in CNNs for crossbarbased architecture. We formulate the issue as a constrained optimization problem and design an
norm constrained gradient descent (LGD) algorithm to solve it. In LGD, a relaxant probabilistic projection (RPP) is heuristically leveraged to switch between the full space and the sparse space for producing the pruning mask. The LGD can accurately control the sparsity and bring better convergence. In this way, we are able to achieve two pruning grains: crossbar grain and column grain. The former one is straightforward and easy to use, while the latter one can obtain better accuracy and the same level of sparsity through a recombination of the nonzero crossbar columns. Furthermore, a reorder strategy of the input FMs is utilized to reduce the pruning error by clustering the FMs with similar importance. Based on an optimized pruning configuration with three elaborate pruning conditions, finally, 44%72% crossbars can be saved on three benchmarking networks on CIFAR10 and ImageNet. This work provides a new codesign solution for mapping CNNs onto various crossbar devices with significantly less resource overhead and the resulting energy consumption.
Although our crossbaraware pruning framework is effective on all testing benchmarks, some limitations still deserve further investigation. (1) The accuracy drop resulted from the pruning of deep layers cannot be significantly improved by the proposed reorder strategy. It is preferred to design a more efficient reorder method for optimizing the tolerance for pruning. (2) We observe that there are many additional crossbars not for compute but for data buffer or activation reduce, which cannot be reduced in current framework. In the future, joint optimization of the mapping and pruning frameworks is required for a better tradeoff between the resource overhead and model accuracy. (3) Although our method is applicable to any crossbar architecture, some real device defects should be sufficiently considered if we use memristorbased devices (e.g. RRAM or PCRAM). For example, each device usually has limited resistance states which causes quantization error and has state variation which causes noise error. Figure 12 shows the additional analysis about influence of quantization and variation based on a pruned model. The model we choose is the VGG8 on CIFAR10 and the sparsity we set here is 50%. The weight with noise is governed by , where
is a random variable obeying a normal distribution whose mean is zero and standard deviation is the ‘device variation’ in the figure. Obviously, the accuracy degrades along with larger variation and fewer quantized levels. In this case,
0.1 variation or 5bit quantization will cause obvious accuracy loss. The accuracy recovery can be obtained by fabrication improvement, as well as the design of programming strategy and model retraining [45].References
 [1] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: a novel processinginmemory architecture for neural network computation in rerambased main memory,” in ACM SIGARCH Computer Architecture News, vol. 44, pp. 27–39, IEEE Press, 2016.
 [2] B. Li, Y. Shan, M. Hu, Y. Wang, Y. Chen, and H. Yang, “Memristorbased approximated computation,” in Proceedings of the 2013 International Symposium on Low Power Electronics and Design, pp. 242–247, IEEE Press, 2013.
 [3] M. Hu, H. Li, Q. Wu, and G. S. Rose, “Hardware realization of bsb recall function using memristor crossbar arrays,” in Proceedings of the 49th Annual Design Automation Conference, pp. 498–503, ACM, 2012.

[4]
J. Zhang, Z. Wang, and N. Verma, “Inmemory computation of a machinelearning classifier in a standard 6t sram array,”
IEEE Journal of SolidState Circuits, vol. 52, no. 4, pp. 915–924, 2017.  [5] P. A. Merolla, J. V. Arthur, R. AlvarezIcaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, et al., “A million spikingneuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
 [6] L. Shi, J. Pei, N. Deng, D. Wang, L. Deng, Y. Wang, Y. Zhang, F. Chen, M. Zhao, S. Song, F. Zeng, G. Li, H. Li, and C. Ma, “Development of a neuromorphic computing system,” in 2015 IEEE International Electron Devices Meeting (IEDM), pp. 4.3.1–4.3.4, Dec 2015.
 [7] M. Davies, N. Srinivasa, T.H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, et al., “Loihi: A neuromorphic manycore processor with onchip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.
 [8] X. Guo, F. M. Bayat, M. Bavandpour, M. Klachko, M. Mahmoodi, M. Prezioso, K. Likharev, and D. Strukov, “Fast, energyefficient, robust, and reproducible mixedsignal neuromorphic classifier based on embedded nor flash memory technology,” in Electron Devices Meeting (IEDM), 2017 IEEE International, pp. 6–5, IEEE, 2017.

[9]
D. Garbin, O. Bichler, E. Vianello, Q. Rafhay, C. Gamrat, L. Perniola, G. Ghibaudo, and B. DeSalvo, “Variabilitytolerant convolutional neural network for pattern recognition applications based on oxram synapses,” in
Electron Devices Meeting (IEDM), 2014 IEEE International, pp. 28–4, IEEE, 2014.  [10] L. Deng, G. Li, N. Deng, D. Wang, Z. Zhang, W. He, H. Li, J. Pei, and L. Shi, “Complex learning in bioplausible memristive networks,” Scientific reports, vol. 5, p. 10684, 2015.

[11]
Y. Long, E. M. Jung, J. Kung, and S. Mukhopadhyay, “Reram crossbar based recurrent neural network for human activity detection,” in
Neural Networks (IJCNN), 2016 International Joint Conference on, pp. 939–946, IEEE, 2016.  [12] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with insitu analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
 [13] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Binary convolutional neural network on rram,” in Design Automation Conference (ASPDAC), 2017 22nd Asia and South Pacific, pp. 782–787, IEEE, 2017.

[14]
L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined rerambased accelerator for deep learning,” in
High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 541–552, IEEE, 2017.  [15] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Nolfo, S. Sidler, M. Giordano, M. Bodini, N. C. Farinha, et al., “Equivalentaccuracy accelerated neuralnetwork training using analogue memory,” Nature, vol. 558, no. 7708, p. 60, 2018.
 [16] G. W. Burr, R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat, R. S. Shenoy, P. Narayanan, K. Virwani, E. U. Giacometti, et al., “Experimental demonstration and tolerancing of a largescale neural network (165 000 synapses) using phasechange memory as the synaptic weight element,” IEEE Transactions on Electron Devices, vol. 62, no. 11, pp. 3498–3507, 2015.
 [17] O. Bichler, M. Suri, D. Querlioz, D. Vuillaume, B. DeSalvo, and C. Gamrat, “Visual pattern extraction using energyefficient “2pcm synapse” neuromorphic architecture,” IEEE Transactions on Electron Devices, vol. 59, no. 8, pp. 2206–2214, 2012.
 [18] D. Fan and S. Angizi, “Energy efficient inmemory binary deep neural network accelerator with dualmode sotmram,” in 2017 IEEE 35th International Conference on Computer Design (ICCD), pp. 609–612, IEEE, 2017.
 [19] P. Merolla, J. Arthur, R. Alvarez, J.M. Bussat, and K. Boahen, “A multicast tree router for multichip neuromorphic systems,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61, no. 3, pp. 820–833, 2014.
 [20] E. Painkras, L. A. Plana, J. Garside, S. Temple, F. Galluppi, C. Patterson, D. R. Lester, A. D. Brown, and S. B. Furber, “Spinnaker: A 1w 18core systemonchip for massivelyparallel neural network simulation,” IEEE Journal of SolidState Circuits, vol. 48, no. 8, pp. 1943–1953, 2013.
 [21] F. Akopyan, J. Sawada, A. Cassidy, R. AlvarezIcaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.J. Nam, et al., “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537–1557, 2015.
 [22] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional networks for fast, energyefficient neuromorphic computing,” Proceedings of the National Academy of Sciences, vol. 113, no. 41, pp. 11441–11446, 2016.
 [23] Y. Ji, Y. Zhang, W. Chen, and Y. Xie, “Bridge the gap between neural networks and neuromorphic hardware with a neural network compiler,” in Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 448–460, ACM, 2018.
 [24] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
 [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
 [26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[27]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  [28] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.
 [29] L. Deng, L. Liang, G. Wang, L. Chang, X. Hu, X. Ma, L. Liu, J. Pei, G. Li, and Y. Xie, “Semimap: a semifolded convolution mapping for speedoverhead balance on crossbars,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, Submitted 2018.
 [30] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
 [31] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [32] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, “Scalpel: Customizing dnn pruning to the underlying hardware parallelism,” in ACM SIGARCH Computer Architecture News, vol. 45, pp. 548–560, ACM, 2017.
 [33] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 [34] W. Wen, Y. He, S. Rajbhandari, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li, “Learning intrinsic sparse structures within long shortterm memory,” arXiv preprint arXiv:1709.05027, 2017.
 [35] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in International Conference on Computer Vision (ICCV), vol. 2, 2017.
 [36] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
 [37] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” arXiv preprint arXiv:1611.06440, 2016.
 [38] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
 [39] A. Ankit, A. Sengupta, and K. Roy, “Trannsformer: Neural network transformation for memristive crossbar based neuromorphic system design,” in Proceedings of the 36th International Conference on ComputerAided Design, pp. 533–540, IEEE Press, 2017.
 [40] L. Deng, G. Li, J. Pei, and J. Huang, “L0 norm constraint based external control source allocation for the minimum cost control of directed networks,” ISA transactions, vol. 76, pp. 88–96, 2018.
 [41] M. Cho and D. Brand, “Mec: memoryefficient convolution for deep neural network,” arXiv preprint arXiv:1706.06873, 2017.
 [42] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al., “Circnn: accelerating and compressing deep neural networks using blockcirculant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395–408, ACM, 2017.
 [43] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
 [44] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, “Gxnornet: Training deep neural networks with ternary weights and activations without fullprecision memory under a unified discretization framework,” Neural Networks, vol. 100, pp. 49–58, 2018.
 [45] A. Mohanty, X. Du, P.Y. Chen, J.s. Seo, S. Yu, and Y. Cao, “Random sparse adaptation for accurate inference with inaccurate multilevel rram arrays,” in Electron Devices Meeting (IEDM), 2017 IEEE International, pp. 6–3, IEEE, 2017.
Comments
There are no comments yet.