1 Introduction
Deep neural networks have achieved great success in many applications, such as computer vision
Krizhevsky et al. (2012), speech recognition Hinton et al. (2012)and Natural Language Processing
Bahdanau et al. (2014). In computer vision, the most proposed architecture, Convolutional Neural Network (CNN), has demonstrated stateoftheart results in many tasks. However, CNNbased recognition systems need large amounts of memory and computational power, which may take up to weeks on a modern multiGPU server for large datasets such as ImageNet
Deng et al. (2009). Hence, they are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption.Accordingly, a variety of literature has made attention on the reduction of model size through the use of quantization Lin et al. (2015); Courbariaux et al. (2015); Lin et al. (2016), lowrank matrix factorization Denton et al. (2014); Jaderberg et al. (2014), architecture pruning Han et al. (2015a, b), etc. Quantization is one of the simpler ways to reduce complexity of any model with less precision requirements for weights and activations as well as speeding up the computation.
The methods for quantizing gradients, weights and activations Lin et al. (2015); Courbariaux et al. (2015); Hubara et al. (2017); Zhou et al. (2016); Rastegari et al. (2016); Banner et al. (2018), have achieved much closer performance to full precision networks. Whereas, after the previous efficient quantized neural network (QNN) training, there still has floating point operators in model inference particularly due to the floattype numbers in Batch Normalization (BN).
Since (deep) networks are hard trained, the BN operator Ioffe and Szegedy (2015)
is usually used. Implementation of the conventional BN requires much computation of mean and variance, involving the sum of squares, squareroot and reciprocal operations. These operators require floattype numbers for high precision. Previous attempts to use low precision networks do not use BN layers
Wu et al. (2018) or keep them in full precision Zhou et al. (2016). In addition, Hubara et al. (2017) proposed shiftbased BN by replacing almost all multiplication with powerof2 approximation and shift operations, and Banner et al. (2018) devised Range BN for the variance normalizing according to the range of input distribution. These modifications to BN for less computational costs are heuristic, confirmed only with experiments. Besides, modern hardware has less deployment cost with fixedpoint style or perhaps some only support fixedpoints Ignatov et al. (2018). Hence, previous QNN training methods with floattype BN layers fail to deploy on total fixednumber hardware.In this paper we develop a direct way to handle the floating numbers in BN after obtaining a benchmark QNNs with previous training methods. We view conventional BN operator combined with several quantization operators used in feedforward step as an affine transformation with only two floating numbers. In order to eliminate the floating numbers in BN, we start by considering the exact substitution of these two floating numbers to as less as possible fixed numbers. We recommend to quantize floating numbers with shared quantized scale and mathematically show that all floating points are able to convert to fixedpoint operators in an idealized case. In addition, we also demonstrate the lower bound and upper bound of the least possible quantized scale in our scheme. After given the proposed quantized scale, we give few search attempts to decide remaining integers. By the way, our methods used for model deployment also guarantee the precision in QNNs’ experiments, which can be seen as a supplement of modern QNN training.
Our main contribution is summarized as follows:

We propose a new method for quantizing BN in model deployment by converting the two floating points affine transformations to a fixedpoint operation with shared quantized scale.

We give theoretical guarantee of our quantization method, including the existence of quantized scale and the magnitude of optimal quantized bitwidth. In addition, we accurately search the least possible quantized scale according to quantized bits to support our transformation numerically.

We conduct our method on CIFAR and ImageNet datasets based on benchmark quantized neural networks, showing little performance degradation if exists.
The remainder of our paper is organized as follows. In Section 2, we formulate the problem and present the motivation of our transformation. In Section 3, we give the equivalence of problem in Section 2 and show the upper bound and lower bound of satisfied solution through several properties. We then give a trivial search algorithm in Section 4, for finding accurate value of the solution and verifying previous results in Section 3. We also briefly discuss the precision loss in practical hardware implementation and validate numerically in common quantized neural networks in Section 5. Finally, we conclude our method in Section 6.
2 Problem Formulation
Quantization mainly focuses on compressing parameters in neural networks, such as weights in convolutional and fullyconnected layers, activation outputs of hidden layers, and parameters in BN. A common design for QNN is uniform quantization Zhou et al. (2017); Krishnamoorthi (2018), which makes quantized values evenly distributed among possible values, that is,
where (the set of positive integers) is quantized scale, which measures the smallest gap between quantized values, and means the floor function. The input is restricted to predefined interval due to limited computing power. Particularly, for bit uniform quantization, choose for friendly hardware support and simply choose for nonsymmetric quantizer, for symmetric quantizer.
The weights which should be uniformly quantized may first be mapped into the required input value range through normalized operator Hubara et al. (2017)
or combined with hyperbolic tangent function Zhou et al. (2016)
Here the maximum is taken over all weights in the same layer.
Activations are quantized more difficultly than weights due to nondifferentiable optimization incurred by quantizer functions Cai et al. (2017); Vanhoucke et al. (2011). Several works Lin et al. (2016); Rastegari et al. (2016); Zhou et al. (2016)
have addressed this problem using bitwidth allocation across layers or a continuous approximation to the quantizer function in the backpropagation step.
A common quantization approach for activations Mishra et al. (2017); Hubara et al. (2017); Zhou et al. (2016) quantizes output of the previous layer after clipping inputs into a limited interval, that is,
where , and are usually integers, and
is some quantizer function defined earlier. For common used ReLU function,
, or Howard et al. (2017); Sandler et al. (2018) (ReLU6) or some specific positive integers.Since weights and activations are of the form with constant quantization bits in denominator, feedforward calculation is able to execute with complete fixedpoint operation without regard of the fixed denominator, which achieves hardware acceleration.
Benchmark models employ BN for deep neural network training. The conventional BN operator is two affine transformations of four parameters:
where and
are the mean and standard error estimators of output in the same layer and updated according to training batch statistics. Except conventional BN, several BN variants use similar transformation with different calculation method, such as
BN Hoffer et al. (2018) with norm of feature as , Range BN Banner et al. (2018) with maximum gap in feature as , Group Normalization Wu and He (2018) and Instance Normalization Ulyanov et al. (2016) with different part of feature for calculating and . Anyhow, all parameters are fixed during inference and deployment.We emphasize that floating points come from the division operation in BN and its variants, since is the running mean of batch feature in the same layer, and are updated based on quantized gradients, but the reciprocal of fails.
Previous stateoftheart networks such as ResNet He et al. (2016) and VGG Simonyan and Zisserman (2014)
employ BN after convolutional layer or fullyconnected layer, followed by an activation function. Let
denote the set of integer numbers. Suppose we use and to quantize activations and weights. Then from quantizer function , we can see outputs of previous layers are of the type with and quantized weight with . Therefore, the outputs of convolutional or fullyconnected layer are of the type , where is the number related to the calculation of outputs in the next layer, and is the bias term if exists and may also be quantized. We record this output of the type with . Here we do not identify specific corresponding position of parameters if no ambiguity.After quantized convolutional or fullyconnected layer is BN and quantized activation, then an element of the next layer outputs is
(1)  
Here we exchange the order of and round function because and are integers in second equality. We also simplify twoaffine operator of BN to oneaffine operator and replace some terms as follows:
Generally, floating number operation will cause loss of precision, so this modification may lead to some error due to the limited computation precision. We will give experiments in subsequent section briefly.
Without loss of generality, we assume clipped from ReLU function. Other cases can be obtained by shifting the clip range by
Then and .
An empirical attempt is trying to make all operators only related to fixedpoints, which means quantizing floating points and to some integers. A simple way is to let floating numbers and be integers respectively, which seems difficult to work because input variable may vary in a wide range. A weaker requirement is to quantize and similar to quantizer function and do, namely approximate and by rational numbers:
Since this consideration would add two more numbers and no hardware support in advance if either or is decided. Another way is to consider whether the surrogates of and can share same quantized scale. For example, same prefixed denominator for whole potential floating numbers and , which is consistent with previous scheme, because
where we set , , . Then and share same prefixed quantized scale .
Thus we want to replace the float numbers to integers for friendly hardware execution, where should not changed when and vary to support fixed hardware design.
Problem 1
Given , the problem is to find some (as small as possible), such that for all , there exist some , such that for all ,
(2) 
In particular, we are able to examine whether can be directly converted to integers if already satisfies.
After having obtained , we also need to decide the remaining quantized number based on the given , which is the second problem we need to tackle.
Problem 2
Given , and a suitable which is one solution in Problem 1, the problem is to find , such that for all ,
Particularly, in the conventional bit quantization setting, we would set for some small and give the least bits number of satisfied .
3 Theoretical Analysis
In this section, we analyze Problem 1 and Problem 2 with some propositions while leave detail in Appendix B.
3.1 Analysis of Problem 1
For Problem 1, a direct way to deal with the randomness of is to specify all clipped intervals when varies.
Proposition 1
Problem 1 is equivalent to finding such a that for all , we can obtain , s.t.
(3) 
Here means the ceiling function. We first analyze Eq. (3) without considering and leave the constraint in Appendix E, which have little influence in practice.
Set . According to Proposition 1, we only need to analyze the sequence (Here we replace by for notation simplicity). It seems hard to give analytic formula of , though we only want to have a rough understanding of the least possible for future hardware support. Let the minimal which satisfies all sequences of length be . First, we need roughly understand whether our transformation is reasonable. The following proposition guarantees the existence of and also gives the magnitude of w.r.t. .
Proposition 2
, exists. Moreover, if , then as long as , we can find satisfying the requirements. Hence, .
The key insight of the existence of is that while is large enough, then leave more choices to search for empirically on account of finite constraints in Eq. (3). The technique of proofs for the above proposition is to simply remove with conditions according to all possible sequences of and apply discreteness of integers. The proof is given in Appendix B.2.
As for lower bound of , we can take some specific sequences to see how large the possible is. The insight of choosing several sequences is making the maximum gap between the items of such sequence small, in order to quick check the existence of .
Proposition 3
If , we have . Furthermore, if , then .
Remark 1
Remark 2
Besides, some special input range would obviously decrease the magnitude of , such as the case Appendix G.2 showed.
3.2 Analysis of Problem 2
Intuitively, given and leaving out floor function and clip function, we have
Hence, we can get . Because of , we obtain (or ) and (or ). However, after subsequent search for all possible , we find it is not always the nearest integer (see Proposition 8 and Appendix B.8). The gap of suitable and intuitive approximation depends on specific . By the way, the intuitive way to obtain is correct most of the time.
To establish given that is suitable based on the previous understanding, we offer a conservative way to obtain all appropriate as Proposition 4 and Algorithm 5 (see Appendix F) mentioned .
Proposition 4
Given and proper , a pair of satisfy Problem 2 if and only if there exists such a that the following conditions met:
(4)  
where .
Moreover, we show that the intuitive way to obtain is partially right in Proposition 8, and give a loose bound of the possible ranges of and in Proposition 9 (see Appendix A). In addition, for general quantizer range instead of in Problem 1, we can apply similar analysis in Appendix G.1
3 (2bit)  2  7 (3bit)  9  15 (4bit)  51  31 (5bit)  289 

63 (6bit)  1459  127 (7bit)  6499  255 (8bit)  28323 
4 Numerical Computation
In the previous section we have discussed theoretically. In this section, we present how to compute the accurate to evaluate the theoretical analysis.
For convenience, we are able to only consider and by Proposition 5. Then , , , for .
Proposition 5
Considering and is enough for finding .
For insurance purpose, we save all possible sequences as a naive idea, then to check every if the corresponding exist. We generate the sequences of recursively based on the following result and the scope of .
Proposition 6
Suppose is a satisfied sequence of length . Then is a satisfied sequence of length . In addition, if is a satisfied sequence of length , there must be a sequence of length and with the same previous terms.
Now we turn to find in an accurate way. On account of Proposition 3, we start search from when is larger than . We use brute force and follow proposition below to search which satisfies all possible sequences .
Proposition 7
Given a sequence , if
(5) 
then there exist that can generate the sequence above.
The whole process is showed below:

First, produce candidate sequences. Note that from Proposition 6, we can get all the recursively.

Second, record all satisfied sequences . Given a candidate sequence , we use Proposition 7 to confirm there exist which can reproduce this sequence.

Third, based on the obtained sequence , check whether the given is satisfied through Proposition 4.
Due to the large amount of the possible sequences , we suggest searching a small part of all sequences and running the window successively until all sequences are satisfied. The main pseudoalgorithm and the auxiliary pseudoalgorithms for search exact are all shown in Appendix F.
4.1 Experimental Results
We give a short Table 1 corresponding to specific bit quantized activations. More with other please see Appendix C. Due to numerical errors, we also reexamine a wide range of input in with large enough in order to ensure every possible input.
We also draw the growth trend of vs in Figures 1. Since saving all sequences is timeconsuming and spaceconsuming, we only search in and . The magnitude depicted in the figures is consistent with Propositions 2 and 3, showing the quadratic growth of .
In practice, there is no necessary requirement of finding , since we only need some proper to convert the floattype parameters in BN as well as provide hardware friendly , so not exactly is the best.
5 Experiments
In this section we assess the effect of our BN quantization method through previous trained QNNs on ImageNet Russakovsky et al. (2015) (we also try on CIFAR10 Krizhevsky et al. (2009) in Appendix H). Our method is applied to model deployment at layer level, which is fitted for various quantized neural networks using the conventional BN or its variants. Besides, we aim at verifying accuracy variation through our modification for friendly hardware support, rather than obtaining high accuracy as previous quantization training methods do.
There are two main concerns we should verify. First, the derivation in Eq. (1) absorbs the parameters of BN into the neighboring convolutional layer or fullyconnected layer, which may have precision degradation. We refer to means absorbing BN parameters used in such layer though Eq. (1) based on quantized training models. Second, because the original theoretical analysis is based on real numbers rather than floating numbers, we should check whether the theoretical results match with the expression of floating numbers. Though numerical experiments have already examined in Section 4, we display the final accuracy to support our method again. We adopt which uses our fixedpoints layerwise replacements in Problem 1 based on in the following experimental tables.
We use DoReFaNet Zhou et al. (2016), one of efficient QNNs to train a quantized model for verification, and denote as the accuracy with DoReFaNet quantization methods.
DoReFaNet has low bitwidth weights and activations using low bitwidth parameter gradients to accelerate both in training and inference. In order to get more suitable baselines, we only choose quantizing weights and activations while using fullprecision gradients in training, and we do not quantize the final fullyconnected layer for better training performance. We adopt to represent bit quantized weights and bit quantized activations. We examine , and in subsequent experimental tables. We choose and for bit and bit quantized activations according to the discussion in Appendix D.
5.1 Quantization for ImageNet Classification
The next set of experiments study our method using VGG16 Simonyan and Zisserman (2014) and ResNet18 He et al. (2016) on ImageNet dataset. We adopt data augmentation including random cropping, horizontal flipping, lighting and color jittering with Adam optimizer using stepwise learning rate initialized from divided by on each epochs with total epochs. After we obtain the trained QNN, we convert final model into and mode we mentioned above. The main results are shown in Tables 2.
5.2 Accuracy and Complexity Analysis
From Table 2, using the QNN training method is able to get comparable performance when quantized even with and . In addition, once a QNN has trained, we absorb the only floatingpoints in BN by our attempts. From the experimental results, enjoys slight difference with the original quantized model , which means one affine transformation in Eq. (1) brings tiny disturbance. Moreover, we observe that our substitution in Problem 1 also introduces the same results between and . In principle, there should have diversity across and when encountering operators which excess the computer precision. We prefer to use our case due to entire fixedpoint implementation and if have, slight performance degradation.
Additionally, practical application of QNN mainly uses small bits (up to 8bits to our knowledge with ) for quantization. From Proposition 4 (or Proposition 9 in Appendix A), a single conversion in worse case is , but time is able to be saved if we search around intuitive way and , while we only need to convert model once with few BN parameters.
Bits  2W4A  4W4A  8W8A  

Top1  Top5  Top1  Top5  Top1  Top5  
69.46  88.84  70.62  89.46  70.83  89.59  
69.43  88.88  70.52  89.45  70.81  89.58  
69.43  88.88  70.52  89.45  70.81  89.58  
VGG16  
65.94  86.54  68.15  88.09  68.67  88.18  
65.99  86.55  68.15  88.11  68.66  88.17  
65.99  86.55  68.15  88.11  68.66  88.17  
ResNet18 
6 Conclusion
In this paper we have investigated the problem of quantizing floating points in BN, combined with quantized convolutional or fullyconnected weights and activations. Noting that the conventional BN includes two affine transformations, we have made all floating points into one affine operator with only two floatingpoints. We have accordingly proposed to quantize each floatingpoint in the converted oneaffine operator sharing a quantized scale. We have shown that possible quantized scale is roughly twice of activation bits in theory and given numerical computation schemes of the corresponding substitution. Our approach enjoys the errorless performance in inference using high precision. The strategy we recommended displays efficient model deployment with complete fixedpoint operators.
It is worth emphasizing that our quantization scheme is suitable for other affine operators which are also common in deep NNs. Beyond BN as well as the NN context, we believe that our quantization scheme has potential applications in other problems that involve quantization.
Broader Impact
a) & b) If the proposed quantization method is verified to be useful in numerous real applications by the engineers in the future, it will produce good impacts on model compression. Hence, previous style of QNNs would pay attention to entire fixedpoint effective QNN design with BN. c) The method we proposed leverages all possible outputs and shows convincing results in theory, though our method may fails when the scope of fixedpoint doesn’t support the converted large range of that we seldom see this in practice. Besides, floating operation is likely to introduce precision error, which our method in first conversion step would encounter. d) Our method is data irrelevant.
References
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
 Scalable methods for 8bit training of neural networks. In Advances in Neural Information Processing Systems, pp. 5145–5153. Cited by: 4th item, §1, §1, §2.

Deep learning with low precision by halfwave gaussian quantization.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5918–5926. Cited by: §2.  Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §1.
 Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
 Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
 Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §5.1.
 Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
 Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pp. 2160–2170. Cited by: 4th item, §2.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.

Quantized neural networks: training neural networks with low precision weights and activations.
The Journal of Machine Learning Research
18 (1), pp. 6869–6898. Cited by: §1, §1, §2, §2.  Ai benchmark: running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.
 Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix H.
 Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2.
 Learning multiple layers of features from tiny images. Cited by: §5.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §1, §2.
 Neural networks with few multiplications. arXiv preprint arXiv:1510.03009. Cited by: §1, §1.
 WRPN: wide reducedprecision networks. arXiv preprint arXiv:1709.01134. Cited by: §2.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1, §2.
 Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §5.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §2.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Appendix H, §2, §5.1.
 Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: 4th item, §2.
 Improving the speed of neural networks on cpus. Cited by: §2.
 Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680. Cited by: §1.
 Group normalization. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 Balanced quantization: an effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32 (4), pp. 667–682. Cited by: §2.
 Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §1, §2, §2, §2, §5.
Appendix A Additional Propositions
Proposition 8
Proposition 9
Appendix B Proof of Propositions
The following proofs may need use property of ceiling function below.
Lemma 1 (The properties of ceiling function)
b.1 Proof of Proposition 1
Proof: Without loss of generality, we may assume , otherwise we can reverse . Then we can see as , and must have the same sign, so while we leave the case in Appendix E.
Then according to the property of floor function, the original problem is equivalent to the follow situations:
(6) 
We can see ,
Hence .
b.2 Proof of Proposition 2
Proof: It is obvious that if , then .
When , we only need to consider the problem in Proposition 1, that is, ,
Set ,
(7) 
Then
Pay attention to , so that
which means
Analyzing the cases of and , we obtain
,
(8) 
It follows from Lemma 1 in Appendix B that , ,
(9) 
, ,
Hence,
Therefore .
We make
(10) 
Then always exists because the scope of is larger than one and includes at least one integer.
A naive idea is that when ,
When and , we have that
So if , we are able to find given any . Therefore, exists.
Before the remaining proof, we need lemma below.
Lemma 2
Suppose and , then and
where is the fractional part of .
Now we turn to the left part of Proposition 2.
Using Proposition 5, we only need consider . Let us take more precise analysis from Eq. (10). Suppose
(11) 
When and ,
So we only need to check .

If , .

If , .
Set . Use Proposition 5, then .

If , then , or , then .
Hence
No solution!

If , then , or , then .
Hence
No solution!
Therefore,


If , .
Set , which lead to
(13) So we can see
Next we will prove that satisfy the follow cases.
Case1. If or , then
Case2. If , then
From Case1, we may assume . Meanwhile take Eq. (13) into the condition , we get , and
Since , therefore
(14) Hence
(15) (16) 
If or , by the previous fact that and Eq. (13), we can get the range of :
We set
then
The last inequality comes from .

When ,
If ,
Then the range of is which less than , and the fraction of the left point of can’t be zero by Lemma 2. So , else , which means , to Case2.
Then . As , so .Due to , hence
Therefore , but this time
No solution!
Therefore , so , to Case2. 
Similarly we can get same results when with
From previous discussion, we obtain
Hence, exists from Eq. (10) when and .
b.3 Proof of Proposition 3
Proof: Consider through special choices of when .

, , with sequence ,

,