Deep neural networks (DNN) are widely used for solving various artificial intelligent (AI) tasks, like image classification[Krizhevsky et al.2012], object detection [Girshick et al.2014]
, natural language processing (NLP)[Li2017] and deep reinforcement learning (RL) [Mnih et al.2013]
. However, the training, inference and storage of a modern deep neural network typically require powerful GPUs, dedicated hardware accelerators and storage resources, which hinders the wide applications of DNN to edge devices where memory and computational capacities are limited. Many research interests have been focused on compressing deep learning models without significant performance degradation to save computation cost and memory storage, such as pruning[Han et al.2015b, Li et al.2016b, Liu et al.2019], quantization [Courbariaux et al.2016, Choi et al.2018, Leng et al.2018] and knowledge distillation [Hinton et al.2015].
For model quantization, many efforts have been made to reduce the model size and accelerate the model inference on various hardwares. It has been well demonstrated that direct quantizing the trained float-point model to 16 bits or 8 bits would not significantly degrade the accuracy. To achieve extremely higher energy efficiency in resource constrained edge devices, the extremely low bit quantization approach is proposed in literature [Courbariaux et al.2015, Li et al.2016a, Choi et al.2018, Rastegari et al.2016]
, where use binary or ternary status to represent the weights and only use very limited bits to represent the activations, which can totally eliminate the multiplication operation. Even the full binary network XNOR[Rastegari et al.2016] is proposed to degenerated the computation to XNOR and pop-count.
However, most of the quantization approaches only investigate uniform bitwidth quantization across all layers of DNN, which is usually sub-optimal under certain compression constraints. The precision of each layer in a network has different influence to the final accuracy, which is discussed in [Wang et al.2018] and also varies with the architectures of the deep networks. The exploration of low-bit hybrid quantization of different network layers is vital for deep compression of DNN without accuracy degradation. The conventional hybrid quantization of DNN requires domain experts and some empirical rules to explore best hybrid quantization policy, which is used in [Wang et al.2018]. Recently, a RL based automatic hybrid quantization searching approach is also proposed in [Wang et al.2019]. They demonstrate that the searched hybrid quantization can explore the compression ability of networks and outperform the uniform quantization. However, they only focus on the relatively high precision quantization, which can fine-tune from a float-point trained network to significantly reduce the search time. In extremely low-bit network case, it is impossible to directly fine-tune from a float-point trained network and tedious training process from scratch takes time and resources, which makes their RL based approach infeasible.
Our method is mainly inspired by the recent work of MetaPruning [Liu et al.2019] and Hypernetworks [Ha et al.2016]. The hypernetworks demonstrate a way to use an external network, known as a hypernetwork, to generate the weights for another network. In the work [Liu et al.2019], a hypernetwork PruningNet is built to generate weight parameters for various pruned target networks. The PruningNet is trained with inputs of random filter numbers as encoding vector for each layer. The optimal pruning structure of target network is obtained by searching the PruningNet. Similarly, we also adopt the meta learning framework to realize automatic low-bit hybrid quantization. Instead of seeking the best pruning structure, we propose to search the best hybrid quantization policy of a quantized network. In our method, We utilize a MetaQuantNet as a hypernetwork to predict weights of each layer in the target quantized network. The MetaQuantNet, together with a Quantization function (Q), are trained to generate the quantized weights for the target network. With certain constraints, the best hybrid quantization policy can be obtained by search the well trained MetaQuantNet. It is worth noticing that in this work we quantize all the layers of the target neural network and only focus on the weights quantization case.
The proposed method is illustrated in Figure 1. The whole process is divided into three stages: training, search and retraining. For the training process, stochastic sampling values between 1-8 that encode the quantization bitwidth are the inputs of the MetaQuantNet, and the quantization encoding vectors control the Quantization function (Q) simultaneously. Each number of the quantization encoding vector corresponds to the bitwidth of target quantized network layer. At second stage, we apply a genetic algorithm to search the best hybrid quantization combinations under certain constraints. Only the results of top-N performance that meet the constraints can be preserved as parent genes to generate the off-springs. Finally, with the searched best hybrid quantization result, we continue to retrain or fine-tune MetaQuantNet to further improve the performance of the quantized target network.
Compared to the existing reinforcement learning based hybrid quantization search approach like HAQ [Wang et al.2019], we find that the reinforcement learning search should be repeatedly executed if given various compression constraints. Each exploration process corresponds the same compression constraints. Best policy can not be obtained until the exploration finished. If the constraints change slightly, the exploration process should be repeated again. However, in our method the MetaQuantNet are trained with various hybrid quantization policy and thus acquire the meta-knowledge for these tasks. Once the MetaQuantNet is well trained, it can predict weights of target networks for various hybrid quantization encoding vector inputs. Hence the MetaQuantNet just needs to be trained once and best hybrid quantization policies can be fast searched for different compression constraints under the same workflow.
The primary contributions of this work include:
We propose a method for automatic low-bit hybrid quantization of neural networks through meta learning, which frees human efforts for designing hybrid bitwidth layer by layer. Besides, our approach can be easily combined with most existing AutoML techniques in an out-of-box fashion: after the optimal neural network structure is gained, a hybrid quantization can be used for further model compression.
Compared to the existing RL based framework [Wang et al.2019], our method is more efficient and feasible in application. Once the MetaQuantNet is well trained, it can be applied under various compression requirements. Moreover, our method shows advantages in the abilities of realizing the extremely low-bit hybrid quantization. It is impossible to obtain the accuracy results for extremely low-bit quantized network by just finetuning float-point networks, which also makes their RL approach unfeasible.
We show that the hybrid quantization strategy can always maintain higher accuracy than traditional uniform quantization policy in extremely low-bit quantization cases. DNN can be compressed more by adopting hybrid quantization policy without significant accuracy degradation.
The searched best hybrid quantization policy can be various under different constraints. But we find that higher bitwidth is preferred for the first layer and last classification layer, which confirms the common design heuristics of hybrid quantization. Moreover, our experiments also show that there exists some layers always need much lower bitwidth representation for different kinds of tasks and constraints.
2 Related Work
Quantization Extensive research works have been carried out on low-bit quantization for model compression. [Han et al.2015a] proposed to use clustering method to push weights to quantized values. BinaryConnect and BinaryWeight networks have been proposed in [Courbariaux et al.2015]. Besides, [Rastegari et al.2016] propose XNOR network to degenerate the computation to XNOR and pop-count operation. Those adopt the binary representation of network weights and activations, which compressed most of the networks but the accuracy drops significantly in their cases. [Li et al.2016a] suggests to use ternary instead of binary and adopt a float scaling parameter to keep the performance. The quantized network training is a mainly problem for low bit quantization. Recently, [Choi et al.2018] propose a novel technique named PACT to clip activations when quantizing both weights and activation during training. They achieve highest accuracy for both low bit weight and activations quantization. [Leng et al.2018] model the low bit network training as discretely constrained optimization problem and utilizes the ADMM method to decouple continuous parameters updates from discrete case. In these works, they do not consider the hybrid quantization strategy for target networks. Our method for hybrid quantization policy exploration can be combined with their training optimization techniques.
AutoML and Meta Learning AutoML have been widely studied to search neural network structures and hyper-parameters tailored to specific task and dataset with minimal human efforts. It has achieved good successes in both vision and language. Existing AutoML works usually use methods based on genetic algorithms [Real et al.2017], random search [Bergstra and Bengio2012], Bayesian optimization [Snoek et al.2012], reinforcement learning [Zoph and Le2017] and continuous differentiable methods [Liu et al.2018]. In our work, we adopt a genetic algorithm to explore a good hybrid quantization policy for the target network. Our work adopt the same meta learning structure as the MetaPruning work [Liu et al.2019]. Instead of search the pruning structure, we realize automatic low-bit hybrid quantization of target networks.
In this section, we formulate our meta learning method for automatic low-bit hybrid quantization of neural networks under certain compression constraints.
The problem of search hybrid quantization policy of a neural network can be formulated as:
where is a quantization function that pushes neural network weights to nearby quantization levels. represents the input dataset. We only quantize weights and our goal is to find a best hybrid quantization policy for the deep neural network layers from to layers such that the loss is minimum under constraints . stands for quantization bitwidth of layer. The cost function is the target compression goal of certain constraints, such as the model size after quantization should be 10 times smaller than original float model size or the energy consumption should be reduced to a certain level with dedicated hardware accelerator.
3.1 Quantization function
We adopt the commonly used equally distributed quantization function that can be easily adapted to edge computing hardware [Choi et al.2018]. The bitwidth quantization function is defined as:
For back-propagation, the gradient of the quantization function
is approximated by Straight-Through Estimation (STE) method[Bengio et al.2013]:
where is manually defined to bound gradients for larger input values.
In our method, a scaling function : is used to normalize arbitrary weights values to at first. The scaling function is defined as:
Then weights are quantized like this:
The gradient of the loss function about weights are
3.2 Hybrid quantization procedure
In our method, we adopt the hypernetwork framework to generate weights of target network from a MetaQuantNet [Ha et al.2016]. The MetaQuantNet takes the target network quantization encoding vector () as input and outputs weights for target networks:
where is the float-point weights of MetaQuantNet that needed to be trained. are the generated weights for target network. To maintain accuracy of quantized network, people normally scale the quantized weights by with a scaling parameters , which can be obtained by minimizing or directly estimating with weight values like in paper [Li et al.2016a, Rastegari et al.2016]. However, we embed a branch structure inside the block in MetaQuantNet to predict the value to realize an end-to-end training procedure, which is similar to [Leng et al.2018].
As shown in Figure 2
, the MetaQuantNet block is a three-layered fully connected network (FC) with the common used activation function ReLU. The first layer takes the quantization bitwidth (q) as the input. The second hidden layer outputs are divided into two parts. One is connected to the third FC layer and Quantization function that output the weights. The other output are then connected to a third layer that outputs only one value as the scaling parameter . Finally, we use as the quantized weights of the corresponding layer after reshape. This block network structure is quite similar to the dueling network structure with two streams in paper [Wang et al.2015].
In the first training stage, the training data is input to the target network, while stochastic generated quantization vectors are input to the MetaQuantNet. The object function in Eq.(1) is the cross-entropy loss between target network results and input ground truth. The weights of MetaQuantNet would be updated by the minibatch stochastic gradient decent (SGD) algorithms with weight decay as regularization.
For the second search stage, since the search space is huge, we adopt the similar evolutionary algorithm as[Liu et al.2019]. During search, we choose the best results that meet the constraints as parent genes to generate the off-springs. The search algorithm can be easily adapted to the our hard constraints optimization problems. Only the hybrid quantization policy that meet the constraints will be remained. In this stage, various search algorithm can also be used like Reinforcement Learning. The main difference between the RL search here and HAQ [Wang et al.2019] is that they need retrain to obtain quantized network accuracy by finetuning float-point pretrained network while we just search the outputs of MetaQuantNet and conduct inference for target network. Hence, our method is much more efficient.
Finally in the third stage, with the searched best hybrid quantization as input, we can retrain the MetaQuantNet from scratch or just finetune to further improve the performance of target quantized network. The finetuned performance is quite good. But the retraining from scratch process can avoid local minimum in application.
4 Experimental Results
In this section, we conduct extensive experiments to verify the effectiveness of the proposed method on two popular image classification datasets: CIFAR-10 and CIFAR-100 [Krizhevsky et al.2014].
4.1 Implementation Details
for CIFAR-10, and CIFAR-100. We use VGG16 with batch normalization and modify the structure by using one FC layer instead of the original three FC layers. Hence the total layer number are 20 and 14 for ResNet-20 and VGG16bn, respectively. VGG16bn-small model has the same structure as VGG16bn but with 4x less filter numbers than VGG16bn.
For the training stage, we use the stochastic gradient decent (SGD) with momentum 0.9 and weight decay
. The learning rate starts with 0.1 and decay half every 30 epochs after first 60 epochs. The total training process takes 200 epochs for all the full precision, quantized training and retraining stages.
4.2 Model-Size Constrained Quantization
|1bit||32x||0.034||87.73%||1.840||90.58 %||0.115||80.90 %|
Since we are focused to study the weights quantization of DNN, the model-size constraint for compression is studied in our experiments for simplicity. The compression ratio is defined as the ratio between float point model size and low bit quantized model size. For the N-bit uniform quantization, the model size is roughly compressed to and the compression ratio is . Hence 32x compression ratio is the upper bound if using 1 bit for all the weights. We mainly focus on low bit quantization and the bitwidth are limited in [1,8] bits. In our experiments, we seek four different compression ratio: 10x, 16x, 20x and 25x. For the 10x case, we can easily generate the hybrid quantization vectors for search and training in [1,8] bits. But for higher compression ratio, we need to narrow our search space to [1,5] bits for 16x compression ratio, and [1,3] bits for even higher compression ratio.
Table 1 shows the experimental results of CIFAR-10 for ResNet-20, VGG16bn and VGG16bn-small. We compare both uniform quantizations and hybrid quantizations under the same MetaQuantNet framework. We also give the baseline results of Binary [Courbariaux et al.2015] and Ternary [Li et al.2016a]
quantization method and Float case results. We utilize the consistent hyperparameters for Binary and Ternary networks. For Binary network, we also introduce a learning scaling parameterto scale the weights, which is different from the original work [Courbariaux et al.2015].
For clear comparison, we illustrate the performance between the uniform quantization and hybrid quantization under different compression ratio in Figure 3. The solid lines represent the uniform quantization and the dash lines stand for hybrid quantization. The markers correspond to the results in Table 1. We can clearly observe that with the hybrid quantization policy, the quantized model accuracy drops much slower than the uniform quantization.
With increasing compression ratio, the top-1 accuracy gradually drops for both case. But the searched hybrid quantization has much higher accuracy compared with the uniform quantization cases. Especially for a higher compression ratio, the hybrid quantization show strong capability to maintain the accuracy of compressed models. Moreover, the uniform quantization policy can just realize discrete compression ratio in 10x, 10.3x,16x and 32x. But hybrid quantizaitons can realize continuous compression ratio between 16x and 32x, in which there are still much more compression space deserve to explore. Hence only hybrid quantization can offer such abilities to achieve deeper compression in extreme low-bit quantization with higher accuracy.
Besides, for CIFAR-10 task we find that both ResNet20 and VGG16bn show strong representation capacity to maintain higher accuracy even in 1 bit quantization case. Hence, we adopt a VGG16bn-small model that behaves poorly for low bit quantization on CIFAR-10. The VGG16bn-small has four times less filters for each layer in VGG16bn. In Table 1, the accuracy drops significantly when using 2 bit or 1 bit uniform quantization but it can still keep better performance for hybrid quantization case even under 20x compression ratio. This trend is clearly demonstrated in Figure 3.
CIFAR-100 is a much harder task than CIFAR-10 and we use VGG16bn and Wide ResNet-20(WRN-20) as target quantized networks. The WRN-20 has the same structure as ResNet-20 but we widen the convolutional layers by adding two more feature planes, which means the widen factor in our experiment. Table 2 show the results between uniform quantization and hybrid quantization strategy under various compression ratio. We also give the baseline results of Binary [Courbariaux et al.2015] and Ternary [Li et al.2016a] quantization method and Float case. From the results, we can obtain the same conclusion as on CIFAR-10. The hybrid quantization policy can keep a better performance accuracy under higher quantization compression ratio. The top-1 accuracy drops much slower than uniform quantization with increasing model compression.
Furthermore, we visualization the searched best hybrid quantization policy under various compression ratio for both VGG16bn and ResNet20 network in Figure 4. Figure 4(a) and (c) stand for the searched best hybrid quantization policy of VGG16bn network on CIFAR-10 and CIFAR-100. Figure 4(b) and (d) stand for the searched best hybrid quantization policy of ResNet20 network on CIFAR-10 and CIFAR-100. We normalize the bitwidth of quantization encoding vector as bitwidth ratio under four different compression ratio for fair comparison. The markers represent the normalized bitwidth for each layer of target network. The solid lines are the average results accordingly. To our surprise, even though the best hybrid quantization policy varied for different task, constraints and task, the distribution of quantization encoding vector show a common pattern. The average lines show the overall trend of bitwidth importance for each layer.
From Figure 4, we obtain insights that it prefers to keep higher bitwidth or precision in the first layer and last layer for both VGG16bn and ResNet20 networks no matter the task is about CIFAR-10 or CIFAR-100. This results confirm the common rule-based heuristics of quantization, such as retaining more bits in the first layer which is vital to the following layers and needs to extract low level features from raw inputs, assigning more bitwidth to the last layer that directly computes the final outputs. Besides, the last three layers except last one layer seem not as important as other layers. They just need the lowest bitwidth representation for all the compression scenario. Such property has not been discovered before. Hence, an automatic way to realize hybrid quantization of target networks are vital for such problems.
In this work, we propose to use meta learning method to realize low bit hybrid quantization of neural networks automatically. The searched best hybrid quantization policy shows much better performance than the uniform quantization case. This MetaQuantNet training, search and retraining framework inherits the advantages of meta learning and is quite efficient and more flexible than the reinforcement learning based method. Moreover, even though the hybrid quantizaiton policy varies under different constraints, the results still show that higher bitwidth is preferred in the first layer and last classification layer, which confirms the common used design heuristics of quantization.
- [Bengio et al.2013] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[Bergstra and Bengio2012]
J. Bergstra and Y. Bengio.
Random search for hyper-parameter optimization.
Journal of Machine Learning Research, 2012.
- [Choi et al.2018] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- [Courbariaux et al.2015] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
- [Courbariaux et al.2016] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
- [Girshick et al.2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In , pages 580–587, 2014.
- [Ha et al.2016] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
- [Han et al.2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- [Han et al.2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
- [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [Hinton et al.2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- [Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- [Krizhevsky et al.2014] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 55, 2014.
- [Leng et al.2018] Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. Extremely low bit neural network: Squeeze the last bit out with admm. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [Li et al.2016a] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
- [Li et al.2016b] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- [Li2017] Hang Li. Deep learning for natural language processing: advantages and challenges. National Science Review, 2017.
- [Liu et al.2018] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
- [Liu et al.2019] Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim Kwang-Ting Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. arXiv preprint arXiv:1903.10258, 2019.
- [Mnih et al.2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- [Rastegari et al.2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
[Real et al.2017]
E. Real, S. Moore, A. Selle, S. Saxena, Y.L. Suematsu, J. Tan, Q.V. Le, and
Large-scale evolution of image classifiers.In ICML, 2017.
- [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [Snoek et al.2012] J. Snoek, H. Larochelle, and R.P. Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, 2012.
- [Wang et al.2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- [Wang et al.2018] Junsong Wang, Qiuwen Lou, Xiaofan Zhang, Chao Zhu, Yonghua Lin, and Deming Chen. Design flow of accelerating hybrid extremely low bit-width neural network in embedded fpga. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL), pages 163–1636. IEEE, 2018.
- [Wang et al.2019] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019.
- [Zoph and Le2017] B. Zoph and Q.V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.