1 Introduction
Deep neural networks (DNNs) are evolving to solve increasingly complex and varied tasks with dramatically growing data size Goodfellow et al. [2016]. As a result, the growth rate of model sizes for recent DNNs leads to slower response times and higher power consumption during inference Han et al. [2016a]. To mitigate such concerns, model compression techniques have been introduced to significantly reduce model size of DNNs while maintaining reasonable model accuracy Goodfellow et al. [2016].
It is well known that DNNs are designed to be overparameterized in order to ease local minima exploration Denil et al. [2013], Jonathan Frankle [2019]. Thus, various model compression techniques have been proposed for highperformance and/or lowpower inference. For example, pruning techniques remove redundant weights (to zero) without compromising accuracy LeCun et al. [1990], in order to achieve memory and computation reduction on devices Han et al. [2015], Molchanov et al. [2017], Zhu and Gupta [2017], Lee et al. [2018b]. As another model compression technique, nonzero weights can be quantized to fewer bits with comparable model accuracy of fullprecision parameters, as discussed in Courbariaux et al. [2015], Rastegari et al. [2016], Hubara et al. [2016], Xu et al. [2018].
To achieve even higher compression ratios, pruning and quantization can be combined to form Sparse Quantized Neural Networks (SQNNs). Intuitively, quantization can leverage parameter pruning since pruning reduces the number of weights to be quantized and quantization loss decreases accordingly Lee and Kim [2018]. Deep compression Han et al. [2016b], trained weight networks (TWN) Li and Liu [2016], trained ternary quantization (TTQ) Zhu et al. [2017], and viterbibased compression Ahn et al. [2019], Lee et al. [2018a] represent recent efforts to synergistically combine pruning and quantization.
To benefit from sparsity, it is important to (1) represent pruned models in a format with small memory footprint and (2) implement fast computations based on a sparse matrix. Even if reduced SQNNs can be generated with a high pruning rate, it is challenging to gain performance enhancement without an inherently parallel sparsematrix decoding process during inference. Structured and blockedbased pruning techniques Li et al. [2017], Anwar et al. [2017], Yu et al. [2017], He et al. [2017], Ye et al. [2018] for DNNs have been proposed to accelerate decoding of sparse matrices using reduced indexing space, as Figure 1 shows. However, coarsegrained pruning associated with reduced indexing space exhibits relatively lower pruning rates compared to unstructured pruning Mao et al. [2017], which masks weights randomly with finegrained granularity. In conventional sparse matrix formats, due to random locations of weights to be pruned, decoding time can vastly differ if decoding processes are conducted in different blocks simultaneously, as shown in the conventional approach of Figure 2.
To enable inherently parallel computations using sparse matrices, this paper proposes a new sparse format. Our main objective is to remove all pruned weights such that the resulting compression ratio tracks the pruning rate, while maintaining a regular format. Interestingly, in VLSI testing, proposals for testdata compression have been developed from similar observations, i.e., there are lots of don’t care bits (= pruned weights in the case of model compression) and the locations of such don’t care bits seem to be random Touba [2006] (the locations of unstructurally pruned weights also seem to be random). We adopt XOR gates, previously used for testdata compression, to decode the compressed bits in a fixed rate during inference, as shown in Figure 2. XOR gates are small enough such that we can embed multiple XOR gates to fully utilize memory bandwidth and decode many sparse blocks concurrently. Correspondingly, we propose an algorithm to find encoded and compressed data to be fed into XOR gates as inputs.
2 Compressed and Structural Representation
Testdata compression usually generates random numbers as outputs using the input data as seed values. The outputs (test data containing don’t care bits) can be compressed successfully if such outputs can be generated by the random number generator using at least one particular input seed data (which is the compressed test data). It is well known that memory reduction can be as high as the portion of don’t care bits Bayraktaroglu and Orailoglu [2001], Touba [2006] if randomness is good enough. Test data compression and SQNNs with finegrained pruning share the following properties: 1) Parameter pruning induces don’t care
values as much as pruning rates and 2) If a weight is unpruned, then each quantization bit is assigned to 0 or 1 with equal probability
Ahn et al. [2019].2.1 Compression and Decompression with XOR gates
We use an XORgate network as a random number generator due to its simple design and strong compression capability (such a generator is not desirable for testdata compression because it requires too many input bits). Suppose that a realnumber weight matrix is quantized to be binary matrices () with as the number of bits for quantization. As the first step of our compression algorithm, we reshape each binary matrix
to be a 1D vector, which is then evenly divided into smaller vector sequences of
size. Then, each of the evenly divided vectors, , including don’t care bits is encoded to be a small vector (of size) without any don’t care bits. Through the XOR gates, each compressed vector is decoded to be consisting of correct care bits and randomly filled don’t care bits with respect to . The structure of XOR gates is fixed during the entire process and, as depicted in Figure 3, can be described as a binary matrix over Galois Field with two elements, , using the connectivity information between the input vector (compressed weights) and . Note that is predetermined and simply designed in a way that each element is randomly assigned to 0 or 1 with the same probability.We intend to generate a random output vector using a seed vector while targeting as many care bits of as possible. In order to increase the number of successfully matched care bits, XOR gates should be able to generate various random outputs. In other words, when the sizes of a seed vector and an output vector are given as and respectively, all possible outputs need to be well distributed in solution space.
Before discussing how to choose and , let us first study how to find a seed vector , given . As shown in Figure 3, the overall operation can be expressed by linear equations over . Note that linear equations associated with don’t care bits in can be ignored, because the XOR decompressor can produce any bits in the locations of don’t care bits. By deleting unnecessary linear equations, the original equation form can be simplified (e.g., with only 4 care bits on the right side of Figure 3). Given the pruning rate , contains care bits on average. Assuming that equations are independent and nontrivial, the required number of seed inputs () can be as small as , wherein the compression ratio becomes . As a result, higher pruning rates lead to higher compression ratios. However, note that the linear equations may not have a corresponding solution when there are too many ‘local’ care bits or there are conflicting equations for a given vector .
2.2 Extra Patches for Lossless Compression
In order to keep our proposed SQNNs representation lossless, we add extra bits to correct unavoidable errors, i.e., patching. An unsolvable implies that the XOR random number generator cannot produce one or more matched care bits of . To resolve such an occasion, we can replace one or more care bits of with don’t care bits to remove conflicting linear equations within , as depicted in Figure 4. We record the locations of replacements as which can be used to recover the original care bits of by flipping the corresponding bits during decompression. For every , indicates the number of replacements for each . Since is always scanned prior to decompressing , the same number of bits is reserved to represent for all compressed vectors in order to maintain a regular compressed format. On the other hand, the size of can be different for each (overall parallelism is not disrupted by different sizes as flipping occurs infrequently).
At the expense of and , our compression technique can reproduce all care bits of and, therefore, does not affect accuracy of DNN models and obviates retraining. In sum, the compressed format includes 1) () compressed from a weight matrix through , 2) , and 3) . Hence, the resulting compression ratio is
(1) 
where is the ^{th} and = . Improving is enabled by increasing
and decreasing the amount of patches. We introduce a heuristic patchsearching algorithm to reduce the number of patches while also optimizing
and .2.3 Experiments Using Synthetic Data
An exhaustive search of patches requires exponentialtime complexity even though such a method minimizes the number of patches. Algorithm 1 is a heuristic patchsearching algorithm that uses masking information corresponding to in . The algorithm incrementally enlarges the equation of by including a care bit only when the enlarged equation is still solvable. Note that make_rref() in Algorithm 1
generates a reduced rowechelon form to quickly verify that the linear equations are solvable. If adding a certain
care bit makes the equations unsolvable, then a don’t care bit takes its place, and and are updated accordingly. Although Algorithm 1 yields more replacement of care bits than an exhaustive search (by up to 10% from our extensive experiments), our simple algorithm has time complexity of , which is much faster.To investigate the compression capability of our proposed scheme, we evaluate a large random weight matrix of 10,000 elements where each element becomes a don’t care bit with the probability of 0.9 (=sparsity or pruning rate). If an element is a care bit, then a 0 or 1 is assigned with the same probability. Notice that randomness of locations in don’t care bits is a feature of finegrained pruning methods and assignment of 0 or 1 to weights with the same probability is obtainable using wellbalanced quantization techniques Ahn et al. [2019], Lee et al. [2018a].
When is fixed, the optimal maximizes the memory reduction offered by Algorithm 1. Figure 4(a) plots the corresponding memory reduction (=) from on the right axis and the amount of and + on the left axis across a range of values when . From Figure 4(a), it is clear that there exists a tradeoff between the size of and the sizes of and . Increasing rapidly reduces while and grow gradually. The highest memory reduction () is achieved when is almost 200, which agrees with the observation that maximum compression is constrained by the relative number of care bits Touba [2006]. Consequently, the resulting compression ratio approaches , where is the sparsity.
Given the relationship above, we can now optimize . Figure 4(b) compares memory reduction for various across different values of . The resulting trend suggests that higher yields more memory reduction. This is because increasing the number of bits used as seed values for the XORgate random number generator enables a larger solution space and, as a result, fewer are needed as increases. The large solution space is especially useful when don’t care bits are not evenly distributed throughout . Lastly, is constrained by the maximum computation time available to run Algorithm 1.
Figure 4(c) presents the relationship between pruning rate and memory reduction when and sweeping . Since high pruning rates translate to fewer care bits and relatively fewer , Figure 4(c) confirms that memory reduction approaches as increases. In other words, maximizing pruning rate is key to compressing quantized weights with high compression ratio. In comparison, ternary quantization usually induces a lower pruning rate Zhu et al. [2017], Li and Liu [2016]. Our proposed representation is best implemented by pruning first to maximize pruning rate and then quantizing the weights.
3 Experiments on various SQNNs
In this section, we show experimental results of the proposed representation for four popular datasets: MNIST, ImageNet [Russakovsky et al., 2015], CIFAR10 [Krizhevsky, 2009], and Penn Tree Bank [Marcus et al., 1994]. Though the compression ratio ideally reaches , the actual results may not, because don’t care bits can be less evenly distributed than the synthetic data we used for Section 2. Hence, we suggest several additional techniques in this section to handle uneven distributions.
3.1 Experimental Results
Weights are pruned by the mask layer generated by binaryindex matrix factorization Lee et al. [2019] after pretraining, and then retrained. Quantization is performed by following the technique proposed in Lee et al. [2018b] and Kapoor et al. [2019], where quantizationaware optimization is performed based on the quantization method from Xu et al. [2018]. The number of bits per weight required by our method is compared with the case of bit quantization with an additional 1bit indicating pruning index (e.g., ternary quantization consists of 1bit quantization and 1bit pruning indication per weight).


Target Model  Pretrained  Pruning and Quantization  











































We first tested our representation using the LeNet5 model on MNIST. LeNet5 consists of two convolutional layers and two fullyconnected layers Han et al. [2015], Lee et al. [2018b]. Since the FC1 layer dominates the memory footprint (93%), we focus only on the FC1 layer whose parameters can be pruned by 95%. With our proposed method, the FC1 layer is effectively represented by only 0.19 bits per weight, which is smaller than ternary quantization, as Figure 6 shows. We also tested our proposed compression techniques on largescale models and datasets, namely, AlexNet [Krizhevsky et al., 2012] on the ImageNet dataset. We focused on compressing FC5 and FC6 fullyconnected layers occupying 90% of the total model size for AlexNet. Both layers are pruned by a pruning rate of 91% [Han et al., 2015] using binaryindex matrix factorization Lee et al. [2019] and compressed by 1bit quantization. The high pruning rate lets us compress the quantized weights by . Overall, FC5 and FC6 layers for AlexNet require only 0.28 bits per weight, which is substantially less than 2 bits per weight required by ternary quantization.
We further verify our compression techniques using ResNet32 He et al. [2016] on the CIFAR10 dataset with a baseline accuracy of 92.5%. The model is pruned and quantized to 2 bits, reaching 91.6% accuracy after retraining. Further compression with our proposed SQNN format yields 1.22 bits per weight, while 3 bits would be required without our proposed compression techniques.
An RNN model with one LSTM layer of size 300 Xu et al. [2018] on the PTB dataset, with performance measured by using Perplexity Per Word, is compressed by our representation. Note that the embedding and softmax matrices usually take a major memory footprint because of increasing vocabulary size in a neural language model wherein these two matrices have several distinguished properties compared with general weight matrices Chen et al. [2018]
. In particular, skewed word frequencies in the vocabulary cause inconsistently distributed
don’t care bits inside the embedding and softmax matrices. In the experiment, we compress the embedding and softmax matrices leveraging the randomness in the LSTM layer by shuffling the quantizedweights using an interleaving algorithm. The resulting compressed model, with pruning and 2bit quantization, requires only 1.67 bits per weight.For various types of layers, our proposed technique, supported by weight sparsity, provides additional compression over ternary quantization. Compression ratios can be further improved by using more advanced pruning and quantization methods (e.g., Wang et al. [2018], Guo et al. [2016]) since the principles of our compression methods do not rely on the specific pruning and quantization methods used.
3.2 Techniques to Reduce
If is large enough, patching overhead is not supposed to disrupt the parallel decoding, ideally. However, even for large , when the nonuniformity of pruning rates is observed over a wide range within a matrix, especially with lower pruning rates, may considerably increase. Large , then, leads to not only degraded compression ratio compared with synthetic data experiments, but also deteriorated parallelism in the decoding process. The following techniques can be considered to reduce .
Blocked Assignment: The compression ratio in the Eq. (8) of Section 3.2 is affected by the maximum of . Note that a particular vector may have an exceptionally large number of care bits. In such a case, even if a quantized matrix consists of mostly don’t care bits and few patches are needed, all of the compressed vectors must employ a large number of bits to track the number of patches. To mitigate such a problem and enhance the compression ratio , we divide a binary matrix into several blocks, and then is obtained in each block independently. Different is assigned to each block to reduce the overall data size.
Minimizing for Small : One simple patchminimizing algorithm is to list all possible inputs (for ) and corresponding outputs through on memory and find a particular that minimizes the number of patches. At the cost of high space complexity and memory consumption, such an exhaustive optimization guarantees minimized . below 30 is a practical value.
Interleaver
: We can occasionally observe that a large group of weights are pruned or unpruned altogether because of unique properties of embedding and softmax layers distinguished from typical layers. Interleaver and deinterleaver, widely used in digital communications
MorelosZaragoza [2006], are useful for such uneven distributions of don’t care bits. Interleaver intermixes weights in a random manner and deinterleaver recovers the original locations of weights. Designing interlevers and deinterleavers with low encoding/decoding complexity for model compression would be an interesting research topic.4 Related Works and Comparison
In this section, we introduce two previous approaches to represent sparse matrices. Table 2 describes CSR format, Viterbibased index format, and our proposed format..
Compressed Sparse Row(CSR): Deep compression Han et al. [2016b] utilizes the Compressed Sparse Row (CSR) format to reduce memory footprint on devices. Unfortunately, CSR formats present irregular data structures not readily supported by highly parallel computing systems such as CPUs and GPUs Lee et al. [2018a]. Due to uneven sparsity among rows, computation time of algorithms based on CSR is limited by the least sparse row Zhou et al. [2018], as illustrated in Figure 2. Although Han et al. [2016a] suggested hardware support via a large buffer to improve load balancing, performance is still determined by the lowest pruning rate of a particular row. In contrast, our scheme provides a perfectly structured format of weights after compression such that high parallelism is maintained.
Viterbi Approaches: Viterbibased compression Lee et al. [2018a], Ahn et al. [2019] attempts to compress pruningindex data and quantized weights with a fixed compression ratio using don’t care bits, similar to our approach. Quantized weights can be compressed by using the Viterbi algorithm to obtain a sequence of inputs to be fed into Viterbi encoders (one bit per cycle). Because only one bit is accepted for each Viterbi encoder, only an integer number (=number of Viterbi encoder outputs) is permitted as a compression ratio, while our proposed scheme allows any rational numbers (=/).
Because only one bit is used as inputs for Viterbi encoders, Viterbibased approaches require large hardware resources. For example, if a memory allows 1024 bits per cycle of bandwidth, then 1024 Viterbi encoders are required, where each Viterbi encoder entails multiple FlipFlops to support sequence detection. On the other hand, our proposed scheme is resourceefficient to support large memory bandwidth because FlipFlops are unnecessary and increasing is only limited by time complexity of Algorithm 1.


CSR Format  Viterbibased Compr.  Proposed  


Encryption  No  Yes  Yes  
Load Balance  Uneven  Even  Even  
Decoding Rate  Variable  Fixed  Fixed  





















5 Conclusion
This paper proposes a compressed representation for Sparse Quantized Neural Networks based on an idea used for testdata compression. Through XOR gates and solving linear equations, we can remove most don’t care bits and a quantized model is further compressed by sparsity. Since our representation provides a regular compressedweight format with fixed and high compression ratios, SQNNs enable not only memory footprint reduction but also inference performance improvement due to inherently parallelizable computations.
References
 Ahn et al. [2019] D. Ahn, D. Lee, T. Kim, and J.J. Kim. Double Viterbi: Weight encoding for high compression ratio and fast onchip reconstruction for deep neural network. In International Conference on Learning Representations (ICLR), 2019.

Anwar et al. [2017]
S. Anwar, K. Hwang, and W. Sung.
Structured pruning of deep convolutional neural networks.
ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.  Bayraktaroglu and Orailoglu [2001] I. Bayraktaroglu and A. Orailoglu. Test volume and application time reduction through scan chain concealment. In Proceedings of the 38th Annual Design Automation Conference, 2001.
 Chen et al. [2018] P. Chen, S. Si, Y. Li, C. Chelba, and C.J. Hsieh. GroupReduce: Blockwise lowrank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, 2018.
 Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015.
 Denil et al. [2013] M. Denil, B. Shakibi, L. Dinh, N. De Freitas, et al. Predicting parameters in deep learning. In Advances in neural information processing systems, pages 2148–2156, 2013.
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 Guo et al. [2016] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems, 2016.
 Han et al. [2015] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 Han et al. [2016a] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 243–254, 2016a.
 Han et al. [2016b] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016b.

He et al. [2016]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778, 2016.  He et al. [2017] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
 Hubara et al. [2016] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: training neural networks with low precision weights and activations. arXiv:1609.07061, 2016.
 Jonathan Frankle [2019] M. C. Jonathan Frankle. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
 Kapoor et al. [2019] P. Kapoor, D. Lee, B. Kim, and S. Lee. Computationefficient quantization method for deep neural networks, 2019. URL https://openreview.net/forum?id=SyxnvsAqFm.
 Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 LeCun et al. [1990] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
 Lee and Kim [2018] D. Lee and B. Kim. Retrainingbased iterative weight quantization for deep neural networks. arXiv:1805.11233, 2018.
 Lee et al. [2018a] D. Lee, D. Ahn, T. Kim, P. I. Chuang, and J.J. Kim. Viterbibased pruning for sparse matrix with fixed and high index compression ratio. In International Conference on Learning Representations (ICLR), 2018a.
 Lee et al. [2018b] D. Lee, P. Kapoor, and B. Kim. Deeptwist: Learning model compression via occasional weight distortion. arXiv:1810.12823, 2018b.
 Lee et al. [2019] D. Lee, S. J. Kwon, P. Kapoor, B. Kim, and G.Y. Wei. Network pruning for lowrank binary indexing. arXiv:1905.05686, 2019.
 Li and Liu [2016] F. Li and B. Liu. Ternary weight networks. arXiv:1605.04711, 2016.
 Li et al. [2017] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017.
 Mao et al. [2017] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
 Marcus et al. [1994] M. Marcus, G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT ’94, pages 114–119, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. ISBN 1558603573. doi: 10.3115/1075812.1075835. URL https://doi.org/10.3115/1075812.1075835.

Molchanov et al. [2017]
D. Molchanov, A. Ashukha, and D. P. Vetrov.
Variational dropout sparsifies deep neural networks.
In
International Conference on Machine Learning (ICML)
, pages 2498–2507, 2017.  MorelosZaragoza [2006] R. H. MorelosZaragoza. The art of error correcting coding. John Wiley & Sons, 2nd edition, 2006.
 Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 Touba [2006] N. A. Touba. Survey of test vector compression techniques. IEEE Design & Test of Computers, 23:294–303, 2006.

Wang et al. [2018]
P. Wang, X. Xie, L. Deng, G. Li, D. Wang, and Y. Xie.
HitNet: Hybrid ternary recurrent neural network.
In Advances in Neural Information Processing Systems, 2018.  Xu et al. [2018] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multibit quantization for recurrent neural networks. In International Conference on Learning Representations (ICLR), 2018.
 Ye et al. [2018] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124, 2018.
 Yu et al. [2017] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 548–560, 2017.
 Zhou et al. [2018] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen. Cambricons: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 15–28. IEEE, 2018.
 Zhu et al. [2017] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.
 Zhu and Gupta [2017] M. Zhu and S. Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR, abs/1710.01878, 2017.
Comments
There are no comments yet.