Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks

05/24/2019
by   Se Jung Kwon, et al.
0

Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations. Despite model size reduction, achieving performance enhancement on devices is, however, still challenging mainly due to the irregular representations of sparse matrix formats. This paper proposes a new representation to encode the weights of Sparse Quantized Neural Networks, specifically reduced by find-grained and unstructured pruning method. The representation is encoded in a structured regular format, which can be efficiently decoded through XOR gates during inference in a parallel manner. We demonstrate various deep learning models that can be compressed and represented by our proposed format with fixed and high compression ratio. For example, for fully-connected layers of AlexNet on ImageNet dataset, we can represent the sparse weights by only 0.09 bits/weight for 1-bit quantization and 91% pruning rate with a fixed decoding rate and full memory bandwidth usage.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/29/2018

Retraining-Based Iterative Weight Quantization for Deep Neural Networks

Model compression has gained a lot of attention due to its ability to re...
11/19/2019

DARB: A Density-Aware Regular-Block Pruning for Deep Neural Networks

The rapidly growing parameter volume of deep neural networks (DNNs) hind...
05/26/2021

Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities

Unstructured neural network pruning algorithms have achieved impressive ...
08/20/2020

Utilizing Explainable AI for Quantization and Pruning of Deep Neural Networks

For many applications, utilizing DNNs (Deep Neural Networks) requires th...
08/26/2020

SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference

In recent years, there has been a flurry of research in deep neural netw...
03/29/2021

Deep Compression for PyTorch Model Deployment on Microcontrollers

Neural network deployment on low-cost embedded systems, hence on microco...
06/12/2020

Dynamic Model Pruning with Feedback

Deep neural networks often have millions of parameters. This can hinder ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) are evolving to solve increasingly complex and varied tasks with dramatically growing data size Goodfellow et al. [2016]. As a result, the growth rate of model sizes for recent DNNs leads to slower response times and higher power consumption during inference Han et al. [2016a]. To mitigate such concerns, model compression techniques have been introduced to significantly reduce model size of DNNs while maintaining reasonable model accuracy Goodfellow et al. [2016].

It is well known that DNNs are designed to be over-parameterized in order to ease local minima exploration Denil et al. [2013], Jonathan Frankle [2019]. Thus, various model compression techniques have been proposed for high-performance and/or low-power inference. For example, pruning techniques remove redundant weights (to zero) without compromising accuracy LeCun et al. [1990], in order to achieve memory and computation reduction on devices Han et al. [2015], Molchanov et al. [2017], Zhu and Gupta [2017], Lee et al. [2018b]. As another model compression technique, non-zero weights can be quantized to fewer bits with comparable model accuracy of full-precision parameters, as discussed in Courbariaux et al. [2015], Rastegari et al. [2016], Hubara et al. [2016], Xu et al. [2018].

To achieve even higher compression ratios, pruning and quantization can be combined to form Sparse Quantized Neural Networks (SQNNs). Intuitively, quantization can leverage parameter pruning since pruning reduces the number of weights to be quantized and quantization loss decreases accordingly Lee and Kim [2018]. Deep compression Han et al. [2016b], trained weight networks (TWN) Li and Liu [2016], trained ternary quantization (TTQ) Zhu et al. [2017], and viterbi-based compression Ahn et al. [2019], Lee et al. [2018a] represent recent efforts to synergistically combine pruning and quantization.

To benefit from sparsity, it is important to (1) represent pruned models in a format with small memory footprint and (2) implement fast computations based on a sparse matrix. Even if reduced SQNNs can be generated with a high pruning rate, it is challenging to gain performance enhancement without an inherently parallel sparse-matrix decoding process during inference. Structured and blocked-based pruning techniques Li et al. [2017], Anwar et al. [2017], Yu et al. [2017], He et al. [2017], Ye et al. [2018] for DNNs have been proposed to accelerate decoding of sparse matrices using reduced indexing space, as Figure 1 shows. However, coarse-grained pruning associated with reduced indexing space exhibits relatively lower pruning rates compared to unstructured pruning Mao et al. [2017], which masks weights randomly with fine-grained granularity. In conventional sparse matrix formats, due to random locations of weights to be pruned, decoding time can vastly differ if decoding processes are conducted in different blocks simultaneously, as shown in the conventional approach of Figure 2.

To enable inherently parallel computations using sparse matrices, this paper proposes a new sparse format. Our main objective is to remove all pruned weights such that the resulting compression ratio tracks the pruning rate, while maintaining a regular format. Interestingly, in VLSI testing, proposals for test-data compression have been developed from similar observations, i.e., there are lots of don’t care bits (= pruned weights in the case of model compression) and the locations of such don’t care bits seem to be random Touba [2006] (the locations of unstructurally pruned weights also seem to be random). We adopt XOR gates, previously used for test-data compression, to decode the compressed bits in a fixed rate during inference, as shown in Figure 2. XOR gates are small enough such that we can embed multiple XOR gates to fully utilize memory bandwidth and decode many sparse blocks concurrently. Correspondingly, we propose an algorithm to find encoded and compressed data to be fed into XOR gates as inputs.

Figure 1: Several types of pruning granularity. In the conventional sparse formats, as a sparse matrix becomes more structured to gain parallelism in decoding, pruning rate becomes lower in general.

Figure 2: Comparison between conventional and proposed sparse matrix decoding procedures given a pruning mask. In the conventional approach, the number of decoding steps for each row can be different (i.e., degraded row-wise parallelism). On the contrary, the proposed approach uses XOR-gate decompressors and decodes each row at one step.

2 Compressed and Structural Representation

Test-data compression usually generates random numbers as outputs using the input data as seed values. The outputs (test data containing don’t care bits) can be compressed successfully if such outputs can be generated by the random number generator using at least one particular input seed data (which is the compressed test data). It is well known that memory reduction can be as high as the portion of don’t care bits Bayraktaroglu and Orailoglu [2001], Touba [2006] if randomness is good enough. Test data compression and SQNNs with fine-grained pruning share the following properties: 1) Parameter pruning induces don’t care

values as much as pruning rates and 2) If a weight is unpruned, then each quantization bit is assigned to 0 or 1 with equal probability

Ahn et al. [2019].

2.1 Compression and Decompression with XOR gates

We use an XOR-gate network as a random number generator due to its simple design and strong compression capability (such a generator is not desirable for test-data compression because it requires too many input bits). Suppose that a real-number weight matrix is quantized to be binary matrices () with as the number of bits for quantization. As the first step of our compression algorithm, we reshape each binary matrix

to be a 1D vector, which is then evenly divided into smaller vector sequences of

size. Then, each of the evenly divided vectors, , including don’t care bits is encoded to be a small vector (of size) without any don’t care bits. Through the XOR gates, each compressed vector is decoded to be consisting of correct care bits and randomly filled don’t care bits with respect to . The structure of XOR gates is fixed during the entire process and, as depicted in Figure 3, can be described as a binary matrix over Galois Field with two elements, , using the connectivity information between the input vector (compressed weights) and . Note that is pre-determined and simply designed in a way that each element is randomly assigned to 0 or 1 with the same probability.

Figure 3: Given a fixed matrix representing the structure of XOR gates, compressing is to solve and the decompression is to calculate the product of and over . To find satisfying , we solve reduced linear equations after removing equations with don’t care bits. Once is obtained, don’t care bits are randomly filled by XOR gates during decompression as byproducts.

We intend to generate a random output vector using a seed vector while targeting as many care bits of as possible. In order to increase the number of successfully matched care bits, XOR gates should be able to generate various random outputs. In other words, when the sizes of a seed vector and an output vector are given as and respectively, all possible outputs need to be well distributed in solution space.

Before discussing how to choose and , let us first study how to find a seed vector , given . As shown in Figure 3, the overall operation can be expressed by linear equations over . Note that linear equations associated with don’t care bits in can be ignored, because the XOR decompressor can produce any bits in the locations of don’t care bits. By deleting unnecessary linear equations, the original equation form can be simplified (e.g., with only 4 care bits on the right side of Figure 3). Given the pruning rate , contains care bits on average. Assuming that equations are independent and non-trivial, the required number of seed inputs () can be as small as , wherein the compression ratio becomes . As a result, higher pruning rates lead to higher compression ratios. However, note that the linear equations may not have a corresponding solution when there are too many ‘local’ care bits or there are conflicting equations for a given vector .

2.2 Extra Patches for Lossless Compression

In order to keep our proposed SQNNs representation lossless, we add extra bits to correct unavoidable errors, i.e., patching. An unsolvable implies that the XOR random number generator cannot produce one or more matched care bits of . To resolve such an occasion, we can replace one or more care bits of with don’t care bits to remove conflicting linear equations within , as depicted in Figure 4. We record the locations of replacements as which can be used to recover the original care bits of by flipping the corresponding bits during decompression. For every , indicates the number of replacements for each . Since is always scanned prior to decompressing , the same number of bits is reserved to represent for all compressed vectors in order to maintain a regular compressed format. On the other hand, the size of can be different for each (overall parallelism is not disrupted by different sizes as flipping occurs infrequently).

Figure 4: For every , additional and are attached to match all care bits. In the example of Figure 3, is as small as an matrix, and the patch data sequence is longer than an XOR input sequence. As explained later in details, however, data for patches is reduced at a faster rate for increasing size of .

At the expense of and , our compression technique can reproduce all care bits of and, therefore, does not affect accuracy of DNN models and obviates retraining. In sum, the compressed format includes 1) () compressed from a weight matrix through , 2) , and 3) . Hence, the resulting compression ratio is

(1)

where is the th and = . Improving is enabled by increasing

and decreasing the amount of patches. We introduce a heuristic patch-searching algorithm to reduce the number of patches while also optimizing

and .

2.3 Experiments Using Synthetic Data

input : masking vector , weight vector , matrix
output : , ,
1:   = empty matrix which column size is
2:  for  to  do
3:     if  is 1 then COMMENT// If a parameter is not pruned
4:        Append a row () to
5:         = make_rref()
6:        if .is_solved() is False then
7:           Remove the last row of from
8:        end if
9:     end if
10:  end for
11:  Solve linear equations using to find
12:  Compute =
13:  Compare with to produce and
14:  Return , ,
Algorithm 1 Patch-Searching Algorithm
(a) Memory reduction by applying Algorithm 1 to a random weight matrix with 10,000 elements with various (pruning rate =0.9, =20)

(b) Memory reduction using various . Pruning rate is 0.90 and ranges from 12 to 60. Each line is stopped when the memory reduction begins to fall.

(c) Graphs of memory reduction using =20 (red line) and (blue line). The gap between those two graphs is reduced with higher pruning rate.
Figure 5: Experimental results using synthetic data and Algorithm 1

An exhaustive search of patches requires exponential-time complexity even though such a method minimizes the number of patches. Algorithm 1 is a heuristic patch-searching algorithm that uses masking information corresponding to in . The algorithm incrementally enlarges the equation of by including a care bit only when the enlarged equation is still solvable. Note that make_rref() in Algorithm 1

generates a reduced row-echelon form to quickly verify that the linear equations are solvable. If adding a certain

care bit makes the equations unsolvable, then a don’t care bit takes its place, and and are updated accordingly. Although Algorithm 1 yields more replacement of care bits than an exhaustive search (by up to 10% from our extensive experiments), our simple algorithm has time complexity of , which is much faster.

To investigate the compression capability of our proposed scheme, we evaluate a large random weight matrix of 10,000 elements where each element becomes a don’t care bit with the probability of 0.9 (=sparsity or pruning rate). If an element is a care bit, then a 0 or 1 is assigned with the same probability. Notice that randomness of locations in don’t care bits is a feature of fine-grained pruning methods and assignment of 0 or 1 to weights with the same probability is obtainable using well-balanced quantization techniques Ahn et al. [2019], Lee et al. [2018a].

When is fixed, the optimal maximizes the memory reduction offered by Algorithm 1. Figure 4(a) plots the corresponding memory reduction (=) from on the right axis and the amount of and + on the left axis across a range of values when . From Figure 4(a), it is clear that there exists a trade-off between the size of and the sizes of and . Increasing rapidly reduces while and grow gradually. The highest memory reduction () is achieved when is almost 200, which agrees with the observation that maximum compression is constrained by the relative number of care bits Touba [2006]. Consequently, the resulting compression ratio approaches , where is the sparsity.

Given the relationship above, we can now optimize . Figure 4(b) compares memory reduction for various across different values of . The resulting trend suggests that higher yields more memory reduction. This is because increasing the number of bits used as seed values for the XOR-gate random number generator enables a larger solution space and, as a result, fewer are needed as increases. The large solution space is especially useful when don’t care bits are not evenly distributed throughout . Lastly, is constrained by the maximum computation time available to run Algorithm 1.

Figure 4(c) presents the relationship between pruning rate and memory reduction when and sweeping . Since high pruning rates translate to fewer care bits and relatively fewer , Figure 4(c) confirms that memory reduction approaches as increases. In other words, maximizing pruning rate is key to compressing quantized weights with high compression ratio. In comparison, ternary quantization usually induces a lower pruning rate Zhu et al. [2017], Li and Liu [2016]. Our proposed representation is best implemented by pruning first to maximize pruning rate and then quantizing the weights.

3 Experiments on various SQNNs

In this section, we show experimental results of the proposed representation for four popular datasets: MNIST, ImageNet [Russakovsky et al., 2015], CIFAR10 [Krizhevsky, 2009], and Penn Tree Bank [Marcus et al., 1994]. Though the compression ratio ideally reaches , the actual results may not, because don’t care bits can be less evenly distributed than the synthetic data we used for Section 2. Hence, we suggest several additional techniques in this section to handle uneven distributions.

3.1 Experimental Results

Weights are pruned by the mask layer generated by binary-index matrix factorization Lee et al. [2019] after pre-training, and then retrained. Quantization is performed by following the technique proposed in Lee et al. [2018b] and Kapoor et al. [2019], where quantization-aware optimization is performed based on the quantization method from Xu et al. [2018]. The number of bits per weight required by our method is compared with the case of -bit quantization with an additional 1-bit indicating pruning index (e.g., ternary quantization consists of 1-bit quantization and 1-bit pruning indication per weight).

 

Target Model Pre-trained Pruning and Quantization
Model
DataSet
Size
Acc.
-bit
Acc.

 

LeNet5 (FC1)
MNIST
800500
99.1%
0.95
1-bit
99.1%
AlexNet
(FC5, FC6)
ImageNet
9K4K (FC5)
4K4K (FC6)
80.3% (T5)
57.6% (T1)
0.91
1-bit
79.6% (T5)
55.9% (T1)
ResNet32
CIFAR10
460.76K
92.5%
0.70
2-bit
91.6%
LSTM
PTB
6.41M
89.6 PPW
0.60
2-bit
93.9 PPW

 

Table 1: Descriptions of models to be compressed by our proposed method. The model accuracy after parameter pruning and quantization is obtained by a binary-index factorization Lee et al. [2019] and alternating multi-bit quantization Xu et al. [2018].

Figure 6: The number of bits to represent each weight for the models in Table 1 using our proposed SQNNs format. (A) means the number of bits for index (compressed by binary-index matrix factorization introduced in Lee et al. [2019]). (B) indicates the number of bits for quantization by our proposed compression technique. Overall, we gain additional 2-11 memory footprint reduction according to sparsity.

We first tested our representation using the LeNet-5 model on MNIST. LeNet-5 consists of two convolutional layers and two fully-connected layers Han et al. [2015], Lee et al. [2018b]. Since the FC1 layer dominates the memory footprint (93%), we focus only on the FC1 layer whose parameters can be pruned by 95%. With our proposed method, the FC1 layer is effectively represented by only 0.19 bits per weight, which is smaller than ternary quantization, as Figure 6 shows. We also tested our proposed compression techniques on large-scale models and datasets, namely, AlexNet [Krizhevsky et al., 2012] on the ImageNet dataset. We focused on compressing FC5 and FC6 fully-connected layers occupying 90% of the total model size for AlexNet. Both layers are pruned by a pruning rate of 91% [Han et al., 2015] using binary-index matrix factorization Lee et al. [2019] and compressed by 1-bit quantization. The high pruning rate lets us compress the quantized weights by . Overall, FC5 and FC6 layers for AlexNet require only 0.28 bits per weight, which is substantially less than 2 bits per weight required by ternary quantization.

We further verify our compression techniques using ResNet32 He et al. [2016] on the CIFAR10 dataset with a baseline accuracy of 92.5%. The model is pruned and quantized to 2 bits, reaching 91.6% accuracy after retraining. Further compression with our proposed SQNN format yields 1.22 bits per weight, while 3 bits would be required without our proposed compression techniques.

An RNN model with one LSTM layer of size 300 Xu et al. [2018] on the PTB dataset, with performance measured by using Perplexity Per Word, is compressed by our representation. Note that the embedding and softmax matrices usually take a major memory footprint because of increasing vocabulary size in a neural language model wherein these two matrices have several distinguished properties compared with general weight matrices Chen et al. [2018]

. In particular, skewed word frequencies in the vocabulary cause inconsistently distributed

don’t care bits inside the embedding and softmax matrices. In the experiment, we compress the embedding and softmax matrices leveraging the randomness in the LSTM layer by shuffling the quantized-weights using an interleaving algorithm. The resulting compressed model, with pruning and 2-bit quantization, requires only 1.67 bits per weight.

For various types of layers, our proposed technique, supported by weight sparsity, provides additional compression over ternary quantization. Compression ratios can be further improved by using more advanced pruning and quantization methods (e.g., Wang et al. [2018], Guo et al. [2016]) since the principles of our compression methods do not rely on the specific pruning and quantization methods used.

3.2 Techniques to Reduce

If is large enough, patching overhead is not supposed to disrupt the parallel decoding, ideally. However, even for large , when the nonuniformity of pruning rates is observed over a wide range within a matrix, especially with lower pruning rates, may considerably increase. Large , then, leads to not only degraded compression ratio compared with synthetic data experiments, but also deteriorated parallelism in the decoding process. The following techniques can be considered to reduce .

Blocked Assignment: The compression ratio in the Eq. (8) of Section 3.2 is affected by the maximum of . Note that a particular vector may have an exceptionally large number of care bits. In such a case, even if a quantized matrix consists of mostly don’t care bits and few patches are needed, all of the compressed vectors must employ a large number of bits to track the number of patches. To mitigate such a problem and enhance the compression ratio , we divide a binary matrix into several blocks, and then is obtained in each block independently. Different is assigned to each block to reduce the overall data size.

Minimizing for Small : One simple patch-minimizing algorithm is to list all possible inputs (for ) and corresponding outputs through on memory and find a particular that minimizes the number of patches. At the cost of high space complexity and memory consumption, such an exhaustive optimization guarantees minimized . below 30 is a practical value.

Interleaver

: We can occasionally observe that a large group of weights are pruned or unpruned altogether because of unique properties of embedding and softmax layers distinguished from typical layers. Interleaver and deinterleaver, widely used in digital communications

Morelos-Zaragoza [2006], are useful for such uneven distributions of don’t care bits. Interleaver intermixes weights in a random manner and deinterleaver recovers the original locations of weights. Designing interlevers and deinterleavers with low encoding/decoding complexity for model compression would be an interesting research topic.

4 Related Works and Comparison

In this section, we introduce two previous approaches to represent sparse matrices. Table 2 describes CSR format, Viterbi-based index format, and our proposed format..

Compressed Sparse Row(CSR): Deep compression Han et al. [2016b] utilizes the Compressed Sparse Row (CSR) format to reduce memory footprint on devices. Unfortunately, CSR formats present irregular data structures not readily supported by highly parallel computing systems such as CPUs and GPUs Lee et al. [2018a]. Due to uneven sparsity among rows, computation time of algorithms based on CSR is limited by the least sparse row Zhou et al. [2018], as illustrated in Figure 2. Although Han et al. [2016a] suggested hardware support via a large buffer to improve load balancing, performance is still determined by the lowest pruning rate of a particular row. In contrast, our scheme provides a perfectly structured format of weights after compression such that high parallelism is maintained.

Viterbi Approaches: Viterbi-based compression Lee et al. [2018a], Ahn et al. [2019] attempts to compress pruning-index data and quantized weights with a fixed compression ratio using don’t care bits, similar to our approach. Quantized weights can be compressed by using the Viterbi algorithm to obtain a sequence of inputs to be fed into Viterbi encoders (one bit per cycle). Because only one bit is accepted for each Viterbi encoder, only an integer number (=number of Viterbi encoder outputs) is permitted as a compression ratio, while our proposed scheme allows any rational numbers (=/).

Because only one bit is used as inputs for Viterbi encoders, Viterbi-based approaches require large hardware resources. For example, if a memory allows 1024 bits per cycle of bandwidth, then 1024 Viterbi encoders are required, where each Viterbi encoder entails multiple Flip-Flops to support sequence detection. On the other hand, our proposed scheme is resource-efficient to support large memory bandwidth because Flip-Flops are unnecessary and increasing is only limited by time complexity of Algorithm 1.

 

CSR Format Viterbi-based Compr. Proposed

 

Encryption No Yes Yes
Load Balance Uneven Even Even
Decoding Rate Variable Fixed Fixed
Parallelism
Limited by
Uneven Sparsity
Number of
Decoders
Number of
Decoders
Memory
Access Pattern
Irregular
(Gather-Scatter)
Regular
Regular
Compressed
Memory Bandwidth
Depends on
on-chip Buffer Structure
1 bit/decoder
bits/decoder
HW Resource
for a Decoder
Large Buffer
to improve load balance
XOR gates
and Flip-Flops
XOR gates
only

 

Table 2: Comparisons of CSR, Viterbi, and our proposed representation. For all of these representation schemes, compression ratio is upper-bounded by sparsity.

5 Conclusion

This paper proposes a compressed representation for Sparse Quantized Neural Networks based on an idea used for test-data compression. Through XOR gates and solving linear equations, we can remove most don’t care bits and a quantized model is further compressed by sparsity. Since our representation provides a regular compressed-weight format with fixed and high compression ratios, SQNNs enable not only memory footprint reduction but also inference performance improvement due to inherently parallelizable computations.

References

  • Ahn et al. [2019] D. Ahn, D. Lee, T. Kim, and J.-J. Kim. Double Viterbi: Weight encoding for high compression ratio and fast on-chip reconstruction for deep neural network. In International Conference on Learning Representations (ICLR), 2019.
  • Anwar et al. [2017] S. Anwar, K. Hwang, and W. Sung.

    Structured pruning of deep convolutional neural networks.

    ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
  • Bayraktaroglu and Orailoglu [2001] I. Bayraktaroglu and A. Orailoglu. Test volume and application time reduction through scan chain concealment. In Proceedings of the 38th Annual Design Automation Conference, 2001.
  • Chen et al. [2018] P. Chen, S. Si, Y. Li, C. Chelba, and C.-J. Hsieh. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, 2018.
  • Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015.
  • Denil et al. [2013] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al. Predicting parameters in deep learning. In Advances in neural information processing systems, pages 2148–2156, 2013.
  • Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • Guo et al. [2016] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems, 2016.
  • Han et al. [2015] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
  • Han et al. [2016a] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254, 2016a.
  • Han et al. [2016b] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016b.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778, 2016.
  • He et al. [2017] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
  • Hubara et al. [2016] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: training neural networks with low precision weights and activations. arXiv:1609.07061, 2016.
  • Jonathan Frankle [2019] M. C. Jonathan Frankle. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • Kapoor et al. [2019] P. Kapoor, D. Lee, B. Kim, and S. Lee. Computation-efficient quantization method for deep neural networks, 2019. URL https://openreview.net/forum?id=SyxnvsAqFm.
  • Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • LeCun et al. [1990] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
  • Lee and Kim [2018] D. Lee and B. Kim. Retraining-based iterative weight quantization for deep neural networks. arXiv:1805.11233, 2018.
  • Lee et al. [2018a] D. Lee, D. Ahn, T. Kim, P. I. Chuang, and J.-J. Kim. Viterbi-based pruning for sparse matrix with fixed and high index compression ratio. In International Conference on Learning Representations (ICLR), 2018a.
  • Lee et al. [2018b] D. Lee, P. Kapoor, and B. Kim. Deeptwist: Learning model compression via occasional weight distortion. arXiv:1810.12823, 2018b.
  • Lee et al. [2019] D. Lee, S. J. Kwon, P. Kapoor, B. Kim, and G.-Y. Wei. Network pruning for low-rank binary indexing. arXiv:1905.05686, 2019.
  • Li and Liu [2016] F. Li and B. Liu. Ternary weight networks. arXiv:1605.04711, 2016.
  • Li et al. [2017] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017.
  • Mao et al. [2017] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
  • Marcus et al. [1994] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT ’94, pages 114–119, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. ISBN 1-55860-357-3. doi: 10.3115/1075812.1075835. URL https://doi.org/10.3115/1075812.1075835.
  • Molchanov et al. [2017] D. Molchanov, A. Ashukha, and D. P. Vetrov. Variational dropout sparsifies deep neural networks. In

    International Conference on Machine Learning (ICML)

    , pages 2498–2507, 2017.
  • Morelos-Zaragoza [2006] R. H. Morelos-Zaragoza. The art of error correcting coding. John Wiley & Sons, 2nd edition, 2006.
  • Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
  • Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Touba [2006] N. A. Touba. Survey of test vector compression techniques. IEEE Design & Test of Computers, 23:294–303, 2006.
  • Wang et al. [2018] P. Wang, X. Xie, L. Deng, G. Li, D. Wang, and Y. Xie.

    HitNet: Hybrid ternary recurrent neural network.

    In Advances in Neural Information Processing Systems, 2018.
  • Xu et al. [2018] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multi-bit quantization for recurrent neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Ye et al. [2018] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124, 2018.
  • Yu et al. [2017] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 548–560, 2017.
  • Zhou et al. [2018] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. Cambricon-s: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 15–28. IEEE, 2018.
  • Zhu et al. [2017] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.
  • Zhu and Gupta [2017] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017.