1 Introduction
Stateoftheart neural networks have demonstrated promising performance in tasks such as image classification and object detection googlenet vgg resnet zfnet . These network models are designed for high accuracy with less consideration in the computational cost and inference delay. Thus deploying them
on resourceconstrained platform such as mobile phones is usually inefficient or even infeasible. Even the recently proposed MobileNet howard2017mobilenets , which is customized for mobile platforms, has a relatively large 4.24M parameters. Extensive studies have been carried out in developing network models for resource constraint platforms. Quantization is one of the most popular approaches bnn xnornet . For example, Bai et al. recently proposed a proximal operators for training lowprecision deep neural netwokrs bai2018proxquant . Courbariaux et al. proposed a radical binary representation of the inputs, weights, and activations bnn . Rastegari et al. xnornet
theoretically analyzed the binary network and introduced a scaling scheme for their XNORNet—a network based on
bnn with higher accuracy. Despite the significant improvement on inference accuracy, none of the above networks is able to achieve comparable accuracy as the fullprecision counterparts.We observe that different layers may have different accuracy sensitivity of quantization, thus a finegrained quantization for each layer has the penitential to preserve accuracy under the same compression rate (defined as the ratio of original model size and compressed model size) compared to traditional coursegrained quantization that uses the same quantization for the entire model. The dotted line in Figure 1 shows the tradeoff between accuracy and compression rate in traditional quantization for VGG16. Our goal is to push the tradeoff between accuracy and compression rate into the shaded region of Figure 1 to achieve better compression efficiency, i.e., higher accuracy under the same compression rate. To achieve this, we propose a finegrained quantization approach that relaxes the search space of quantization bitwidth from discrete to continuous domain and applies gradient descend optimization to generate best quantization scheme for each layer, i.e., applies lower bit for less quantization sensitive layers while preserving high bit precision for quantization sensitive layers. Our experimental results show that the proposed approach outperforms the accuracy of traditional quantization methods under the same compression rate.
2 Proposed Approach
In this section, we propose a methodology to judiciously determine the best quantization scheme for each layer based on each layer’s accuracy sensitivity of quantization. For easy description, we only use twolevel quantization: binary and 8bit quantization as an example to illustrate our approach and conduct a preliminary evaluation. It is straightforward to extend to more quantization levels. We relax the discrete variables to a continuous domain as the finer granularity of which can provide more accurate indication in quantization searching. We adopt gradient descent based searching algorithm as it is fast and can be easily deployed in different machine learning frameworks.
We use Softmax function to relax the search space from discrete to continuous. We denote the output of layer with continuous relaxation as . For example, binary and 8bit quantization represented as and . Softmax can be translated as the probability of binary and 8bit quantization, respectively. Thus can be computed as
(1) 
where
is the batch normalization operation. The output
is used as the input of the following layer. The search space for a network with layer is . To explore the tradeoff between different quantization schemes, we model the target objective function as(2) 
(3) 
where
(4) 
Here represents the model size, is the cross entropy loss, is the expected maximum loss, denotes the weights of the model, and represents the coefficient of either quantization method (binary or 8bit) in a certain layer. In our model, (3) is the constraint for optimization problem (2). We can rewrite the above as a bilevel optimization problem:
(5)  
To solve this bilevel optimization problem, we adopt the approximate algorithm in liu2018darts . First, we retrain the network to find the weights that result in the minimal loss on the training set. Then the Lagrange multiplier problem is solved by fixing the weights. As shown in Algorithm 1, solving the Lagrange multiplier problem starts with maximizing the target function w.r.t. : if , approaches 0; otherwise approaches infinite. Here
is a tunable hyperparameter representing the tolerance of accuracy drop. Larger
tolerates less accuracy drop but may also result in smaller compression rate. While smaller can potentially achieve a higher compression rate, it may cause larger accuracy drop. Our setting of is using the (expected or target) loss in full precision model. Finally we minimize the target function w.r.t. . Once obtaining the hyperparameter set with the best tradeoff, we retrain the quantization and fine tune the quantized weights to generate the final network model.3 Experimental Evaluation
We evaluate our proposed methodology on a pretrained 2layer depthwise separable convolution neural network using MNIST data set as well as VGG16 neural network model using CIFAR10 data set. For each model, we compare our approach with the following baselines: 32bit floating point (fullprecision) model, 8bit fixed precision model, and binary fixed precision model. As shown in Table
1, the results of MNIST experiment suggest that our algorithm is capable of find a quantization scheme that achieves 28x compression rate while keeping the accuracy drop less than 0.5%. In CIFAR10 experiment, we set as 0.6 for VGG16. Comparing to whole binary quantization, our approach obtains a compression rate that is very close to binary quantization while gaining 1.5% more accuracy. Figure 2 shows the memory consumption of our model and the original 32bit full precision model. The memory usage is dramatically decreased especially at the middle layers. It is worth mentioning that our method is orthogonal to weight pruning. Combing with stateoftheart pruning methods zhang2018adam progadmm which achieve approximately 30x compression rate, the overall compression rate can be up to approximately 900x.MNIST  CIFAR10  
Quant.  Comp.  Accu.(%)  Comp.  Accu.(%) 
float32  1  98.66  1  84.80 
8bit  4  98.48  4  84.07 
ours  28  98.20  30  83.06 
binary  32  96.34  32  81.56 
4 Conclusion and Ongoing Work
In this paper, we propose a differentiable mixedprecision search method for compressing deep neural networks efficiently. Unlike the traditional quantization methods, our approach relaxes quantization bitwidths to a continuous domain and combined with loss function. Deep neural networks can be either quantized from the start of training phase or from a pretrained model using our proposed methodology. Moreover, our approach ensures quantized model remain a similar accuracy while being compressed up to 30X.
The proposed methodology is not tied into any specific neural network topology, so it can potentially be extended to mixedprecision quantization of different neural network architectures, such as RNN and LSTM. We are currently working on providing more quantization options for each layer. For example, each layer can be quantized to bits, and . These new quantization options drastically increase the search space. Therefore, We plan to design a predictor combined with autoencdoerdecoder architecture to expedite the search process of layerwise quantization.
Acknowledgement
This work is supported in part by the following grants: National Science Foundation CCF1756013, IIS1838024, 1717657 and Air Force Research Laboratory FA87501820057.
References
 [1] C. Szegedy and et al. Going deeper with convolutions. In CVPR, 2015.
 [2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[3]
K. He and et al.
Deep residual learning for image recognition.
In
IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [4] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV. Springer, 2014.
 [5] A. Howard and et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [6] M. Courbariaux and et al. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.

[7]
M. Rastegari and et al.
Xnornet: Imagenet classification using binary convolutional neural networks.
In ECCV, pages 525–542. Springer, 2016.  [8] Yu Bai, YuXiang Wang, and Edo Liberty. Proxquant: Quantized neural networks via proximal operators. arXiv preprint arXiv:1810.00861, 2018.
 [9] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [10] Tianyun Zhang, Kaiqi Zhang, Shaokai Ye, Jiayu Li, Jian Tang, Wujie Wen, Xue Lin, Makan Fardad, and Yanzhi Wang. Adamadmm: A unified, systematic framework of structured weight pruning for dnns. arXiv preprint arXiv:1807.11091, 2018.
 [11] Shaokai Ye and et al. Progressive weight pruning of deep neural networks using admm. arXiv preprint arXiv:1810.07378v1, 2018.