1 Introduction
The performance of convolutional networks, or ConvNets, on image classification has steadily improved since the introduction of AlexNet NIPS2012_4824 . This progress has been fueled by deeper and richer architectures such as the ResNets DBLP:journals/corr/HeZRS15 and their variants ResNeXts Xie_2017 and DenseNets Huang_2017 . Those models particularly benefit from the recent progress made with weak supervision DBLP:journals/corr/abs180500932 ; yalniz2019billionscale ; berthelot2019mixmatch , which leverage GPUs with more memory and larger training datasets. These improvements are likely to continue and also transfer well to other vision applications He_2017 . There is a growing need compressing the best ConvNets for applications on embedded devices like robotics or virtual/augmented reality.
Compression of ConvNets has been an active research topic in the recent years, leading to networks with a top1 accuracy on ImageNet object classification that fits in MB wang2018haq . Most of this progress stems from compression algorithms tailored for mobileefficient architectures like MobileNets DBLP:journals/corr/HowardZCKWWAA17 or ShuffleNets DBLP:journals/corr/ZhangZLS17 . A lot of work focuses on improving the performance of these architectures sandler2018mobilenetv2 , but they are still behind the best ConvNets by a large margin. For instance, a MobileNetv2 achieves top1 accuracy while the latest ResNetbased models are now at hu2018squeeze
. Improving the quality of image classifiers on embedded devices may require to depart from these mobileefficient models and to focus on more traditional ConvNets. A difficulty is that these models have completely different architectures. Compression methods designed for mobileefficient ConvNets do not transfer convincingly to ResNetlike networks, and vice versa. For example, MobileNets separate convolutions into two distinct operations that minimize the correlation between their weights. This design choice significantly reduces their memory footprint but makes structuredbased compression algorithms irrelevant. On the contrary, standard ConvNets are based on standard convolutions where the correlations between the filters can be exploited.
In this work, we propose a compression method particularly adapted to ResNetlike architectures. Our approach takes advantage of the high correlation in the convolutions by the use of a structured quantization algorithm, Product Quantization (PQ) jegou11pq . We focus on bytealigned indexes (8bits) as they lead to efficient inference on standard hardware, as opposed to entropy decoders han2015deep .
One of the challenges with applying standard sequential compression methods to modern ConvNets is their depth: the errors resulting from the compression accumulate when going up, resulting in a drift in performance. We reduce this issue by guiding the sequential compression with the uncompressed network. This allows for both an efficient layerbylayer compression procedure and a global finetuning of the codewords based on distillation HinVin15Distilling .
Finally, our approach departs from traditional scalar han2015deep and vector gong2014compressing quantizers by focusing on the reconstruction of the activations, not of the weights. As depicted by Figure 1, this allows a better indomain reconstruction and does not require any supervision: we only need a set of unlabelled input images to build the internal representation. We show that applying our approach to the semisupervised ResNet50 of Yalniz et al. yalniz2019billionscale leads to a 5 MB memory footprint and a top1 accuracy on ImageNet object classification (hence 20 compression vs. the original model).
2 Related work
As there exists a large body of literature for network compression, we review the work closest to ours and refer the reader to the two recent surveys DBLP:journals/corr/abs180804752 ; DBLP:journals/corr/abs171009282 for a more comprehensive overview.
Lowprecision training.
Since early works like those of Courbariaux et al. DBLP:journals/corr/CourbariauxB16 ; DBLP:journals/corr/CourbariauxBD15 , researchers have developed various approaches to train networks with low precision weights. Those approaches include training with binary or ternary weights DBLP:journals/corr/abs171007739 ; DBLP:journals/corr/ZhuHMD16 ; DBLP:journals/corr/LiL16 ; rastegari2016xnor , learning a combination of binary bases DBLP:journals/corr/abs171111294 and quantizing the activations DBLP:journals/corr/ZhouNZWWZ16 ; DBLP:journals/corr/ZhouYGXC17 ; DBLP:journals/corr/abs170901134 . Some of these methods assume the possibility to employ specialized hardware that speed up inference time and improve power efficiency by replacing most arithmetic operations with bitwise operations. However, the backpropagation has to be adapted to the case where the weights are discrete using accumulation or projection techniques. Despite some noticeable progress in the context of binary networks, this particular choice of discretization suffers a significant drop in accuracy.
Quantization.
Vector Quantization (VQ) and Product Quantization (PQ) have been extensively studied in the context of nearestneighbor search Jegou:2011:PQN:1916487.1916695 ; Ge:2014:OPQ:2693342.2693356 ; norouzi2013cartesian . The idea is to decompose the original highdimensional space into a cartesian product of subspaces that are quantized separately (sometimes with a joint codebook). To our knowledge, Gong et al. gong2014compressing were the first to introduce these stronger quantizers in the context of neural network quantization. As we will see in the remainder of this paper, employing this discretization offtheshelf does not optimize the right objective function, and leads to a catastrophic drift for deep networks.
Pruning.
Network pruning amounts to removing connections according to an importance criteria (typically the magnitude of the weight associated with this connection) until the desired model size/accuracy tradeoff is reached Cun90optimalbrain . A natural extension of this work is to prune structural components of the network, for instance by enforcing channellevel sparsity Liu_2017 . However, those methods alternate between pruning and retraining steps and thus typically require a long training time.
Architectures.
Architectures such as SqueezeNet squeezenet , NASNet DBLP:journals/corr/ZophVSL17 , ShuffleNet DBLP:journals/corr/ZhangZLS17 ; DBLP:journals/corr/abs180711164 or MobileNet DBLP:journals/corr/HowardZCKWWAA17 ; DBLP:journals/corr/abs180104381 typically rely on a combination of depthwise and pointwise convolutional filters, sometimes along with channel splitting or shuffling. The architecture is either designed by hand or using the framework of architecture search howard2019searching . Those architectures trade off a low memory footprint against a loss of accuracy. For instance, the respective model size and test top1 accuracy of ImageNet of a Mobilenet are MB for , to be compared with a vanilla ResNet50 with size MB for a top1 of
. Moreover, larger models such as ResNets can benefit from largescale weakly or semisupervised learning to reach better performance
DBLP:journals/corr/abs180500932 ; yalniz2019billionscale .Combining some of the mentioned approaches yields high compression factors as demonstrated by Han et al. with Deep Compression (DC) han2015deep or more recently by Tung & Mori tung2018 where the authors successively prune and quantize various architectures. Moreover and from a practical point of view, the process of compressing networks depends on the type of hardware on which the networks will run. Recent work directly quantizes to optimize energyefficiency and latency time on a specific hardware DBLP:journals/corr/abs181108886 . Finally, the memory overhead of storing the full activations is negligible compared to the storage of the weights for two reasons. First, in realistic realtime inference setups, the batch size is almost always equal to one. Second, a forward pass only requires to store the activations of the current layer –which are often smaller than the size of the input– and not the whole activations of the network.
3 Our approach
In this section, we describe our strategy for network compression. We first introduce our approach in the simple case of a fullyconnected layer and then apply it to a convolutional layer. Finally, we show how to extend our approach to quantize a modern ConvNet architecture.
3.1 Layer quantization
In this section, we describe our layer quantization method in the case of a fullyconnected layer and a convolutional layer. The specificity of our approach is that it aims at a small reconstruction error for the outputs of the layer, not the layer weights themselves.
3.1.1 Fullyconnected layer
In this section, we consider a fullyconnected layer with weights and, without loss of generality, we omit the bias since it does not impact reconstruction error.
Product Quantization (PQ).
Applying the PQ algorithm to the columns of consists in evenly splitting each column into contiguous subvectors and learning a codebook on the resulting subvectors. Then, a column of is quantized by mapping each of its subvector to its nearest codeword in the codebook. For simplicity, we assume that is a multiple of , i.e., that all the subvectors have the same dimension .
More formally, the codebook contains codewords of dimension . Any column of is mapped to its quantized version where denotes the index of the codeword assigned to the first subvector of , and so forth. The codebook is then learned by minimizing the following objective function:
(1) 
where denotes the quantized weights. This objective can be efficiently minimized with means. When is set to , PQ is equivalent to vector quantization (VQ) and when is equal to , it is the scalar means algorithm. The main benefit of PQ is its expressivity: each column is mapped to a vector in the product , thus PQ generates an implicit codebook of size .
Our algorithm.
PQ quantizes the weight matrix of the fullyconnected layer. However, in practice, we are interested in preserving the output of the layer, not its weights. This is illustrated in the case of a nonlinear classifier in Figure 1: preserving the weights a layer does not necessarily guarantee preserving its output. In other words, the Frobenius approximation of the weights of a layer is not guaranteed to be the best approximation of the output over some arbitrary domain (in particular for indomain inputs). We thus propose an alternative to PQ that directly minimizes the reconstruction error on the output activations obtained by applying the layer to indomain inputs. More precisely, given a batch of input activations , we are interested in learning a codebook that minimizes the difference between the output activations and their reconstructions:
(2) 
where is the output and its reconstruction. Our objective is a reweighting of the objective in Equation (1). We can thus learn our codebook with a weighted means algorithm. First, we unroll of size into of size i.e. we split each row of into subvectors of size and stack these subvectors. Next, we adapt the EM algorithm as follows.

Estep (cluster assignment). Recall that every column is divided into subvectors of dimension . Each subvector is assigned to the codeword such that
(3) This step is performed by exhaustive exploration. Our implementation relies on broadcasting to be computationally efficient.

Mstep (codeword update). Let us consider a codeword . We denote the subvectors that are currently assigned to . Then, we update , where
(4) This step is performed by explicitly computing the solution of the leastsquares problem^{2}^{2}2Denoting the MoorePenrose pseudoinverse of , we obtain . In our implementation, we perform the computation of the pseudoinverse of before alternating between the Expectation and Minimization steps as it does not depend on the learned codebook .
We initialize the codebook by uniformly sampling vectors among those we wish to quantize. After performing the Estep, some clusters may be empty. To resolve this issue, we iteratively perform the following additional steps for each empty cluster of index . (1) Find codeword corresponding to the most populated cluster ; (2) define new codewords and , where and (3) perform again the Estep. We proceed to the Mstep after all the empty clusters are resolved. We set and we observe that its generally takes less than 1 or 2 EM iterations to resolve all the empty clusters. Note that the quality of the resulting compression is sensitive to the choice of and we discuss its influence in Section 4.1.
3.1.2 Convolutional layers
Despite being presented in the case of a fullyconnected layer, our approach works on any set of vectors. As a consequence, our apporoach can be applied to a convolutional layer if we split the associated D weight matrix into a set of vectors. There are many ways to split a D matrix in a set of vectors and we are aiming for one that maximizes the correlation between the vectors since vector quantization based methods work the best when the vectors are highly correlated.
Given a convolutional layer, we have filters of size , leading to an overall D weight matrix . The dimensions along the output and input coordinate have no particular reason to be correlated. On the other hand, the spatial dimensions related to the filter size are by nature very correlated: nearby patches or pixels likely share information. As depicted in Figure 2, we thus reshape the weight matrix in a way that lead to spatially coherent quantization. More precisely, we quantize spatially into subvectors of size using the following procedure. We first reshape into a 2D matrix of size . Column of the reshaped matrix corresponds to the ^{th} filter of and is divided into subvectors of size . Similarly, we reshape the input activations accordingly to so that reshaping back the matrix yields the same result as . In other words, we adopt a dual approach to the one using bilevel Toeplitz matrices to represent the weights. Then, we apply our method exposed in Section 3.1 to quantize each column of into subvectors of size with codewords, using as input activations in (2). As a natural extension, we also quantize with larger subvectors, for example subvectors of size , see Section 4 for details.
3.2 Network quantization
In this section, we describe our approach for quantizing a neural network. First, we quantize the network sequentially starting from the lowest layer to the highest layer. Then, in a final step, we globally finetune the codebooks of all the layers to reduce any residual drifts and we update the running statistics of the BatchNorm layers.
3.2.1 Bottomup quantization
We explain below the quantization procedure in which we guide the compression of the student network by the noncompressed teacher network.
Learning the codebook.
We recover the current input activations of the layer, i.e. the input activations obtained by forwarding a batch of images through the quantized lower layers, and we quantize the current layer using those activations. Experimentally, we observed a drift in both the reconstruction and classification errors when using the activations of the noncompressed network rather than the current activations.
Finetuning the codebook.
We finetune the codewords by distillation HinVin15Distilling using the noncompressed network as the teacher network and the compressed network (up to the current layer) as the student network. Denoting (resp.
) the output probabilities of the teacher (resp. student) network, the loss we optimize is the KullbackLeibler divergence
Finetuning on codewords is done by averaging the gradients of each subvector assigned to a given codeword. More formally, after the quantization step, we fix the assignments once for all. Then, denoting the subvectors that are assigned to codeword , we perform the SGD update with a learning rate(5) 
Experimentally, we find the approach to perform better than finetuning on the target of the images as demonstrated in Table 3. Moreover, this approach does not require any labelled data.
3.2.2 Global finetuning
We empirically find it beneficial to finetune all the centroids after the whole network is quantized. The finetuning procedure is exactly the same as described in Secction 3.2.1
, except that we additionally switch the BatchNorms to the training mode, meaning that the learnt coefficients are still fixed but that the batch statistics (running mean and variance) are still being updated with the standard moving average procedure.
We perform the global finetuning using the standard ImageNet training set for epochs with an initial learning rate of , a weight decay of and a momentum of . The learning rate is decayed by a factor every epochs. As demonstrated in the ablation study in Table 3, finetuning on the true labels performs worse than finetuning by distillation. A possible explanation is that the supervision signal coming from the teacher network is richer than the onehot vector used as a traditional learning signal in supervised learning HinVin15Distilling .
4 Experiments
4.1 Experimental setup
We quantize vanilla ResNet18 and ResNet50 architectures pretrained on the ImageNet dataset imagenet_cvpr09
. Unless explicit mention of the contrary, the pretrained models are taken from the PyTorch model zoo
^{3}^{3}3https://pytorch.org/docs/stable/torchvision/models. We run our method on a 16 GB Volta V100 GPU. Quantizing a ResNet50 with our method (including all finetuning steps) takes about one day on 1 GPU. We detail our experimental setup below. Our code and the compressed models are opensourced.Compression regimes.
We explore a large block sizes (resp. small block sizes) compression regime by setting the subvector size of regular 33 convolutions to (resp. ) and the subvector size of pointwise convolutions to (resp. ). For ResNet18, the block size of pointwise convolutions is always equal to . The number of codewords or centroids is set to for each compression regime. Note that we clamp the number of centroids to for stability. For instance, the first layer of the first stage of the ResNet50 has size 64 64 1 , thus we always use centroids with a block size . For a given number of centroids , small blocks lead to a lower compression ratio than large blocks.
Sampling the input activations.
Before quantizing each layer, we randomly sample a batch of training images to obtain the input activations of the current layer and reshape it as described in Section 3.1.2. Then, before each iteration (E+M step) of our method, we randomly sample rows from those reshaped input activations.
Hyperparameters.
We quantize each layer while performing steps of our method (sufficient for convergence in practice). We finetune the centroids of each layer on the standard ImageNet training set during iterations with a batch size of (resp ) for the ResNet18 (resp. ResNet50) with a learning rate of , a weight decay of and a momentum of . For accuracy and memory reasons, the classifier is always quantized with a block size and (resp. ) centroids for the ResNet18 (resp. ResNet50). Moreover, the first convolutional layer of size is not quantized, as it represents less than (resp. ) of the weights of a ResNet18 (resp. ResNet50).
Metrics.
We focus on the tradeoff between two metrics, namely accuracy and memory. The accuracy is the top1 error on the standard validation set of ImageNet. The memory footprint is calculated as the indexing cost (number of bits per weight) plus the overhead of storing the centroids in float16. As an example, quantizing a layer of size with centroids (1 byte per subvector) and a block size of leads to an indexing cost of KB for blocks plus the cost of storing the centroids of KB.
4.2 Image classification results
We report below the results of our method applied to various ResNet models. First, we compare our method with the state of the art on the standard ResNet18 and ResNet50 architecture. Next, we show the potential of our approach on a competitive ResNet50. Finally, an ablation study validates the pertinence of our method.
Vanilla ResNet18 and ResNet50.
We evaluate our method on the ImageNet benchmark for ResNet18 and ResNet50 architectures and compare our results to the following methods: Trained Ternary Quantization (TTQ) DBLP:journals/corr/ZhuHMD16 , LRNet DBLP:journals/corr/abs171007739 , ABCNet DBLP:journals/corr/abs171111294
, Binary Weight Network (XNORNet or BWN)
rastegari2016xnor , Deep Compression (DC) han2015deep and HardwareAware Automated Quantization (HAQ) DBLP:journals/corr/abs181108886 . We report the accuracies and compression factors in the original papers and/or in the two surveys DBLP:journals/corr/abs180804752 ; DBLP:journals/corr/abs171009282 for a given architecture when the result is available. We do not compare our method to DoReFaNet DBLP:journals/corr/ZhouNZWWZ16 and WRPN DBLP:journals/corr/abs170901134 as those approaches also use lowprecision activations and hence get lower accuracies, e.g., 51.2% top1 accuracy for a XNORNet with ResNet18. The results are presented in Figure 1. For better readability, some results for our method are also displayed in Table 1. We report the average accuracy and standard deviation over 3 runs. Our method significantly outperforms state of the art papers for various operating points. For instance, for a ResNet18, our method with large blocks and
centroids reaches a larger accuracy than ABCNet () with a compression ratio that is 2x larger. Similarly, on the ResNet50, our compressed model with centroids in the large blocks setup yields a comparable accuracy to DC (2 bits) with a compression ratio that is 2x larger.The work by Tung & Mori tung2018 is likely the only one that remains competitive with ours with a 6.8 MB network after compression, with a technique that prunes the network and therefore implicitly changes the architecture. The authors report the delta accuracy for which we have no direct comparable top1 accuracy, but their method is arguably complementary to ours.
Model (original top1)  Compression  Size ratio  Model size  Top1 (%) 

ResNet18 (69.76%)  Small blocks  29x  1.54 MB  65.81 
Large blocks  43x  1.03 MB  61.10  
ResNet50 (76.15%)  Small blocks  19x  5.09 MB  73.79 
Large blocks  31x  3.19 MB  68.21 
Semisupervised ResNet50.
Recent works DBLP:journals/corr/abs180500932 ; yalniz2019billionscale have demonstrated the possibility to leverage a large collection of unlabelled images to improve the performance of a given architecture. In particular, Yalniz et al. yalniz2019billionscale use the publicly available YFCC100M dataset journals/corr/ThomeeSFENPBL15 to train a ResNet50 that reaches top1 accuracy on the standard validation set of ImageNet. In the following, we use this particular model and refer to it as semisupervised ResNet50. In the low compression regime (block sizes of 4 and 9), with centroids (practical for implementation), our compressed semisupervised ResNet50 reaches 76.12% top1 accuracy. In other words, the compressed model attains the performance of a vanilla, noncompressed ResNet50 while having a size of 5 MB (vs. 97.5MB for the noncompressed ResNet50).
Comparison for a given size budget.
To ensure a fair comparison, we compare our method for a given model size budget against the reference methods in Table 2. It should be noted that our method can further benefit from advances in semisupervised learning to boosts the performance of the noncompressed and hence of the compressed network.
Size budget  Best previous published method  Ours 

1 MB  70.90% (HAQ DBLP:journals/corr/abs181108886 , MobileNet v2)  64.01% (vanilla ResNet18) 
5 MB  71.74% (HAQ DBLP:journals/corr/abs181108886 , MobileNet v1)  76.12% (semisup. ResNet50) 
10 MB  75.30% (HAQ DBLP:journals/corr/abs181108886 , ResNet50)  77.85% (semisup. ResNet50) 
Ablation study.
We perform an ablation study on the vanilla ResNet18 to study the respective effects of quantizing using the activations and finetuning by distillation (here, finetuning refers both to the perlayer finetuning described in Section 3 and to the global finetuning after the quantization described in Section 4.1). We refer to our method as Act + Distill. First, we still finetune by distillation but change the quantization: instead of quantizing using our method (see Equation (2)), we quantizing using the standard PQ algorithm and do not take the activations into account, see Equation (1). We refer to this method as No act + Distill. Second, we quantize using our method but perform a standard finetuning using the image labels (Act + Labels). The results are displayed in Table 3. Our approach consistently yields significantly better results. As a side note, quantizing all the layers of a ResNet18 with the standard PQ algorithm and without any finetuning leads to top1 accuracies below for all operating points, which illustrates the drift in accuracy occurring when compressing deep networks with standard methods (as opposed to our method).
Compression  Centroids  No act + Distill  Act + Labels  Act + Distill (ours) 

Small blocks  256  64.76  65.55  65.81 
512  66.31  66.82  67.15  
1024  67.28  67.53  67.87  
2048  67.88  67.99  68.26  
Large blocks  256  60.46  61.01  61.18 
512  63.21  63.67  63.99  
1024  64.74  65.48  65.72  
2048  65.94  66.21  66.50 
4.3 Image detection results
To demonstrate the generality of our method, we compress the Mask RCNN architecture used for image detection in many reallife applications He_2017 . We compress the backbone (ResNet50 FPN) in the small blocks compression regime and refer the reader to the opensourced compressed model for the block sizes used in the various heads of the network. We use centroids for every layer. We perform the finetuning (layerwise and global) using distributed training on 8 V100 GPUs. Results are displayed in Table 4. We argue that this provides an interesting point of comparison for future work aiming at compressing such architectures for various applications.
Model  Size  Box AP  Mask AP 

Noncompressed  170 MB  37.9  34.6 
Compressed  6.51 MB  33.9  30.8 
5 Conclusion
We presented a quantization method based on Product Quantization that gives state of the art results on ResNet architectures and that generalizes to other architectures such as Mask RCNN. Our compression scheme does not require labeled data and the resulting models are bytealigned, allowing for efficient inference on CPU.
Further research directions include testing our method on a wider variety of architectures. In particular, our method can be readily adapted to simultaneously compress and transfer ResNets trained on ImageNet to other domains. Finally, we plan to take the nonlinearity into account to improve our reconstruction error.
Acknowledgements
We thank Fransisco Massa and Alexandre Défossez for their advice regarding code optimization.
References
 [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, 2015.
 [2] I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billionscale semisupervised learning for image classification. arXiv eprints, 2019.

[3]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems. 2012. 
[4]
Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He.
Aggregated residual transformations for deep neural networks.
In
Conference on Computer Vision and Pattern Recognition
, 2017.  [5] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. Conference on Computer Vision and Pattern Recognition, 2017.
 [6] Dhruv Mahajan, Ross B. Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. CoRR, 2018.
 [7] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semisupervised learning. arXiv preprint arXiv:1905.02249, 2019.
 [8] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask rcnn. International Conference on Computer Vision (ICCV), 2017.
 [9] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: hardwareaware automated quantization. arXiv preprint arXiv:1811.08886, 2018.
 [10] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, 2017.
 [11] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, 2017.
 [12] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [13] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In Conference on Computer Vision and Pattern Recognition, 2018.
 [14] Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 [15] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations, 2016.

[16]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
Distilling the knowledge in a neural network.
NIPS Deep Learning Workshop
, 2014.  [17] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 [18] Yunhui Guo. A survey on methods and theories of quantized neural networks. CoRR, 2018.
 [19] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, 2017.
 [20] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, 2016.
 [21] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, 2015.
 [22] Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. CoRR, 2017.
 [23] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. CoRR, 2016.
 [24] Fengfu Li and Bin Liu. Ternary weight networks. CoRR, 2016.
 [25] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, 2016.
 [26] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. CoRR, 2017.
 [27] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, 2016.
 [28] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. CoRR, 2017.
 [29] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. WRPN: wide reducedprecision networks. CoRR, 2017.
 [30] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
 [31] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell., 2014.

[32]
Mohammad Norouzi and David J Fleet.
Cartesian kmeans.
In Conference on Computer Vision and Pattern Recognition, 2013.  [33] Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, 1990.
 [34] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. International Conference on Computer Vision, 2017.
 [35] Forrest Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and ¡0.5mb model size. CoRR, 2016.
 [36] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, 2017.
 [37] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet V2: practical guidelines for efficient CNN architecture design. CoRR, 2018.
 [38] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, 2018.
 [39] Andrew Howard, Mark Sandler, Grace Chu, LiangChieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. arXiv eprints, 2019.
 [40] Frederick Tung and Greg Mori. Deep neural network compression by inparallel pruningquantization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [41] Kuan Wang, Zhijian Liu, Yujun Lin andx Ji Lin, and Song Han. HAQ: hardwareaware automated quantization. CoRR, 2018.
 [42] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In Conference on Computer Vision and Pattern Recognition, 2009.
 [43] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and LiJia Li. The new data and new challenges in multimedia research. CoRR, 2015.