1. Introduction
Quantization is crucial for DL inference on mobile/IoT platforms, which have very limited budget for power and memory consumption. Such platforms often rely on fixedpoint computational hardware blocks, such as Digital Signal Processor (DSP), to achieve higher power efficiency over float point processor, such as GPU. On existing DL models, such as VGGNet (VGGNet, ), GoogleNet (GoogleNet, ), ResNet (ResNet, ) and etc., although quantization may not impact inference accuracy for their overparameterized design, it would be difficult to deploy those models on mobile platforms due to large computation latency. Many lightweight networks, however, can trade off accuracy with efficiency by replacing conventional convolution with depthwise separable convolution, as shown in the Figure 1(a)(b). For example, the MobileNets proposed by Google, drastically shrink parameter size and memory footprint, thus are getting increasingly popular in mobile platforms. The downside is that the separable convolution core layer in MobileNetV1 causes large quantization loss, and thus resulting in significant feature representation degradation in the 8bit inference pipeline.
To demonstrate the quantization issue, we selected TensorFlow implementation of MobileNetV1
(mobilenetmodel, ) and InceptionV3 (gv3, ), and compared their accuracy on float pipeline against bit quantized pipeline. The results are summarized in Table1. The top accuracy of InceptionV3 drops slightly after applying the bit quantization, while the accuracy loss is significant for MobileNetV1.Networks  Float Pipeline  8bit Pipeline  Comments 

InceptionV3  78.00%  76.92%  Only standard convolution 
MobileNetV1  70.50%  1.80%  Mainly separable convolution 
There are a few ways that can potentially address the issue. The most straight forward approach is quantization with more bits. For example, increasing from bit to bit could boost the accuracy (dlcompressionsurvey, ), but this is largely limited by the capability of target platforms. Alternatively, we could retrain the network to generate a dedicated quantized model for fixedpoint inference. Google proposed a quantized training framework (quanttraining, ) codesigned with the quantized inference to minimize the loss of accuracy from quantization on inference models. The framework simulates quantization effects in the forward pass of training, whereas backpropagation still enforces float pipeline. This retraining framework can reduce the quantization loss dedicatedly for fixedpoint pipeline at the cost of extra training, also the system needs to maintain multiple models for different platforms.
In this paper, we focus on a new architecture design for the separable convolution layer to build lightweight quantizationfriendly networks. The proposed new architecture requires only single training in the float pipeline, and the trained model can then be deployed to different platforms with float or fixedpoint inference pipelines with minimum accuracy loss. To achieve this, we look deep into the root causes of accuracy degradation of MobileNetV1 in the 8bit inference pipeline. And based on the findings, we proposed a rearchiteched quantizationfriendly MobileNetV1 that maintains a competitive accuracy with float pipeline, but a much higher inference accuracy with a quantized 8bit pipeline. Our main contributions are:

We identified batch normalization and ReLU6 are the major root causes of quantization loss for MobileNetV1.

We proposed a quantizationfriendly separable convolution, and empirically proved its effectiveness based on MobileNetV1 in both the float pipeline and the fixedpoint pipeline.
2. Quantization Scheme and Loss Analysis
In this section, we will explore the TensorFlow (TF) (tf, ) bit quantized MobileNetV1 model, and find the root cause of the accuracy loss in the fixedpoint pipeline. Figure 2 shows a typical bit quantized pipeline. A TF bit quantized model is directly generated from a pretrained float model, where all weights are firstly quantized offline. During the inference, any float input will be quantized to an bit unsigned value before passing to a fixedpoint runtime operation, such as QuantizedConv2d, QuantizedAdd, and QuantizedMul, etc. These operations will produce a bit accumulated result, which will be converted down to an bit output through an activation requantization step. Noted that this output will be the input to the next operation.
2.1. TensorFlow bit Quantization Scheme
TensorFlow bit quantization uses a uniform quantizer, in which all quantization steps are of equal size. Let represent for the float value of signal , the TF bit quantized value, denoted as can be calculated as:
(1) 
(2) 
where represents for the quantization step size; is the bitwidth, i.e., , and is the offset value such that float value is exactly represented. and are the min and max values of in the float domain, and represents for the nearest rounding operation. In the TensorFlow implementation, it is defined as
(3) 
where sgn() is the sign of the signal , and represents for the floor operation.
Based on the definitions above, the accumulated result of a convolution operation is computed by:
(4)  
Finally, given known min and max values of the output, by combining equation (1) and (4), the requantized output can be calculated by multiplying the accumulated result with , and then subtracting the output offset .
(5)  
2.2. Metric for Quantization Loss
As depicted in Figure 2, there are five types of loss in the fixedpoint quantized pipeline, e.g., input quantization loss, weight quantization loss, runtime saturation loss, activation requantization loss, and possible clipping loss for certain nonlinear operations, such as ReLU6. To better understand the loss contribution that comes from each type, we use SignaltoQuantizationNoise Ratio (SQNR), defined as the power of the unquantized signal devided by the power of the quantization error as a metric to evaluate the quantization accuracy at each layer output.
(6) 
Since the average magnitude of the input signal is much larger than the quantization step size
, it is reasonable to assume that the quantization error is zero mean with uniform distribution and the probability density function (PDF) integrates to
(quanttheory, ). Therefore, for an bit linear quantizer, the noise power can be calculated by(7) 
Substituting equation (2) and (7) into equation (6), we get
(8) 
SQNR is tightly coupled with signal distribution. From equation (8), it is obvious that SQNR is determined by two terms: the power of the signal , and the quantization range. Therefore, increasing the signal power or decreasing the quantization range can help to increase the output SQNR.
2.3. Quantization Loss Analysis on MobileNetV1
2.3.1. BatchNorm in Depthwise Convolution Layer
As shown in Figure 1(b), a typical MobileNetV1 core layer consists of a depthwise convolution and a pointwise convolution, each of which followed by a Batch Normalization (batchnorm, )
and a nonlinear activation function, respectively. In the TensorFlow implementation, ReLU6
(relu6, ) is used as the nonlinear activation function. Consider a layer input , with channels and elements in each channel within a minibatch, the Batch Normalization Transform in depthwise convolution layer is applied on each channel independently, and can be expressed by,(9)  
where represents for the normalized value of on channel . and
are mean and variance over the minibatch.
and are scale and shift. Noted that is a given small constant value. In the TensorFlow implementation, .The Batch Normalization Transform can be further folded in the fixedpoint pipeline. Let
(10) 
equation (9) can be reformulated as
(11)  
In the TensorFlow implementation, for each channel , can be combined with weights and folded into the convolution operations to further reduce the computation cost.
Depthwise convolution is applied on each channel independently. However, the min and max values used for weights quantization are taken collectively from all channels. An outlier in one channel can easily cause a huge quantization loss for the whole model due to an enlarged data range. Without correlation crossing channels, depthwise convolution may prone to produce allzero values in one channel, leading to zero variance (
) for that specific channel. This is commonly observed in MobileNetV1 models. Refer to equation (10), zero variance of channel would produce a very large value of due to the small constant value of . Figure 3 shows observed values across channels extracted from the first depthwise convolution layer in MobileNetV1 float model. It is noticed that the outliers of caused by the zerovariance issue largely increase the quantization range. As a result, the quantization bits are wasted on preserving those large values since they all correspond to allzerovalue channels, while those small values corresponding to informative channels are not well preserved after quantization, which badly hurts the representation power of the model. From our experiments, without retraining, proper handling the zerovariance issue by changing the variance of a channel with allzero values to the mean value of variances of the rest of channels in that layer, the top accuracy of the quantized MobileNetV1 on ImageNet2012 validation dataset can be dramatically improved from to on TF8 inference pipeline.A standard convolution both filters and combines inputs into a new set of outputs in one step. In MobileNetV1, the depthwise separable convolution splits this into two layers, a depthwise separable layer for filtering and a pointwise separable layer for combining (mobilenetv1, ), thus drastically reducing computation and model size while preserving feature representations. Based on this principle, we can remove the nonlinear operations, i.e., Batch Normalization and ReLU6, between the two layers, and let the network learn proper weights to handle the Batch Normalization Transform directly. This procedure preserves all the feature representations, while making the model quantizationfriendly. To further understand perlayer output accuracy of the network, we use SQNR, defined in equation (8) as a metric, to observe the quantization loss in each layer. Figure 4 compares an averaged perlayer output SQNR of the original MobileNetV1 with folded into convolution weights (black curve) with the one that simply removes Batch Normalization and ReLU6 in all depthwise convolution layers (blue curve). We still keep the Batch Normalization and ReLU6 in all pointwise convolution layers. images are randomly selected from ImageNet2012 validation dataset (one in each class). From our experiment, introducing Batch Normalization and ReLU6 between the depthwise convolution and pointwise convolution largely in fact degrades the perlayer output SQNR.
2.3.2. ReLU6 or ReLU
In this section, we still use SQNR as a metric to measure the effect of choosing different activation functions in all pointwise convolution layers. Noted that for a linear quantizer, SQNR is higher when signal distribution is more uniform, and is lower when otherwise. Figure 4
shows an averaged perlayer output SQNR of MobileNetV1 by using ReLU and ReLU6 as different activation functions at all pointwise convolution layers. A huge SQNR drop is observed in the first pointwise convolution layer while using ReLU6. Based on equation (
8), although ReLU6 helps to reduce the quantization range, the signal power also gets reduced by the clipping operation. Ideally, this should produce similar SQNR with that of ReLU. However, clipping the signal at early layers may have a side effect of distorting the signal distribution to make it less quantization friendly, as a result of compensating the clipping loss during training. As we observed, this leads to a large SQNR drop from one layer to the other. Experimental result on the improved accuracy by replacing ReLU6 with ReLU will be shown in Section 4.2.3.3. L2 Regularization on Weights
Since SQNR is tightly coupled with signal distribution, we further enable the L2 regularization on weights in all depthwise convolution layers during the training. The L2 regularization penalizes weights with large magnitudes. Large weights could potentially increase the quantization range, and make the weight distribution less uniform, leading to a large quantization loss. By enforcing a better weights distribution, a quantized model with an increased top accuracy can be expected.
3. QuantizationFriendly Separable Convolution for MobileNets
Based on the quantization loss analysis in the previous section, we propose a quantizationfriendly separable convolution framework for MobileNets. The goal is to solve the large quantization loss problem so that the quantized model can achieve similar accuracy to the float model while no retraining is required for the fixedpoint pipeline.
3.1. Architecture of the Quantizationfriendly Separable Convolution
Figure 1(b) shows the separable convolution core layer in the current MobileNetV1 architecture, in which a Batch Normalization and a nonlinear activation operation are introduced between the depthwise convolution and the pointwise convolution. From our analysis, due to the nature of depthwise convolution, this architecture would lead to a problematic quantization model. Therefore, in Figure 1(c), three major changes are made to make the separable convolution core layer quantizationfriendly.

Batch Normalization and ReLU6 are removed from all depthwise convolution layers. We believe that a separable convolution shall consist of a depthwise convolution followed by a pointwise convolution directly without any nonlinear operation between the two. This procedure not only well preserves feature representations, but is also quantizationfriendly.

All ReLU6 are replaced with ReLU in the rest layers. In the TensorFlow implementation of MobileNetV1, ReLU6 is used as the nonlinear activation function. However, we think is a very arbitrary number. Although (relu6, ) indicates that ReLU6 can encourage a model learn sparse feature earlier, clipping the signal at early layers may lead to a quantizationunfriendly signal distribution, and thus largely decreases the SQNR of the layer output.

The L2 Regularization on the weights in all depthwise convolution layers are enabled during the training.
3.2. A QuantizationFriendly MobileNetV1 Model
The layer structure of the proposed quantizationfriendly MobileNetV1 model is shown in Table2, which follows the overall layer structure defined in (mobilenetv1, ). The separable convolution core layer has been replaced with the quantizationfriendly version as described in the previous section. This model still inherits the efficiency in terms of the computational cost and model size, while achieves high precision for fixedpoint processor.
Input  Operator  Repeat  Stride 

224x224x3  Conv2d+ReLU  1  2 
112x112x32  DC+PC+BN+ReLU  1  1 
112x112x64  DC+PC+BN+ReLU  1  2 
56x56x128  DC+PC+BN+ReLU  1  1 
56x56x128  DC+PC+BN+ReLU  1  2 
28x28x256  DC+PC+BN+ReLU  1  1 
28x28x256  DC+PC+BN+ReLU  1  2 
14x14x512  DC+PC+BN+ReLU  5  1 
14x14x512  DC+PC+BN+ReLU  1  2 
7x7x1024  DC+PC+BN+ReLU  1  2 
7x7x1024  AvgPool  1  1 
1x1x1024  Conv2d+ReLU  1  1 
1x1x1000  Softmax  1  1 
4. Experimental Results
We train the proposed quantizationfriendly MobileNetV1 float models using the TensorFlow training framework. We follow the same training hyperparameters as MobileNetV1 except that we use one Nvidia GeForce GTX TITAN X card and a batch size of 128 is used during the training. ImageNet2012 dataset is used for training and validation. Note that the training is only required for float models.
The experimental results on taking each change into the original MobileNetV1 model in both the float pipeline and the bit quantized pipeline are shown in Figure 5. In the float pipeline, our trained float model achieves similar top accuracy as the original MobileNetV1 TF model. In the bit pipeline, by removing the Batch Normalization and ReLU6 in all depthwise convolution layers, the top accuracy of the quantized model can be dramatically improved from to . In addition, by simply replacing ReLU6 with ReLU, the top accuracy of bit quantized inference can be further improved to . Furthermore, by enabling the L2 regularization on weights in all depthwise convolution layers during the training, the overall accuracy of the bit pipeline can be improved by another . From our experiments, the proposed quantizationfriendly MobileNetV1 model achieves an accuracy of in the bit quantized pipeline, while maintaining an accuracy of in the float pipeline for the same model.
5. Conclusion and Future Work
We proposed an effective quantizationfriendly separable convolution architecture, and integrated it into MobileNets for image classification. Without reducing the accuracy in the float pipeline, our proposed architecture shows a significant accuracy boost in the bit quantized pipeline. To generalize this architecture, we will keep applying it on more networks based on separable convolution, e.g., MobileNetV2 (mobilenetv2, ), ShuffleNet (shufflenet, ) and verify their fixedpoint inference accuracy. Also, we will apply proposed architecture to object detection and instance segmentation applications. And we will measure the power and latency with the proposed quantization friendly MobileNets on device.
References
 (1) A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Apr. 17, 2017, https://arxiv.org/abs/1704.04861.
 (2) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. Sep.4, 2014, https://arxiv.org/abs/1409.1556.
 (3) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on CVPR, pages , 2015. 1
 (4) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Dec. 10, 2015, https://arxiv.org/abs/1512.03385.
 (5) B. Jacob., S Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalengichenko. Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference. Dec.15, 2017, https://arxiv.org/abs/1712.05877.
 (6) Google TensorFlow MobileNetV1 Model. https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_1.0_224_float_2017_11_08.zip
 (7) Google TensorFlow InceptionV3 Model. http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz
 (8) Google TensorFlow Framework. https://www.tensorflow.org/
 (9) S. Loff, and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Feb. 11, 2015, https://arxiv.org/abs/1502.
 (10) Udo Zölzer. Digital Audio Signal Processing , Chapter 2 John Wiley & Sons, Dec. 15, 1997

(11)
A. Krizhevsky.
Convolutional Deep Belief Networks on CIFAR10
. http://www.cs.utoronto.ca/ kriz/convcifar10aug2010.pdf  (12) M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. Jan. 13, 2018, https://arxiv.org/abs/1801.04381.
 (13) X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices Dec. 7, 2017, https://arxiv.org/abs/1707.01083.
 (14) J. Cheng, P. Wang, G. Li, Q. Hu, and H. Lu. Recent Advances in Efficient Computation of Deep Convolutional Neural Networks Feb. 11, 2018, https://arxiv.org/abs/1802.00939.