Log In Sign Up

Optimal Quantization for Batch Normalization in Neural Network Deployments and Beyond

Quantized Neural Networks (QNNs) use low bit-width fixed-point numbers for representing weight parameters and activations, and are often used in real-world applications due to their saving of computation resources and reproducibility of results. Batch Normalization (BN) poses a challenge for QNNs for requiring floating points in reciprocal operations, and previous QNNs either require computing BN at high precision or revise BN to some variants in heuristic ways. In this work, we propose a novel method to quantize BN by converting an affine transformation of two floating points to a fixed-point operation with shared quantized scale, which is friendly for hardware acceleration and model deployment. We confirm that our method maintains same outputs through rigorous theoretical analysis and numerical analysis. Accuracy and efficiency of our quantization method are verified by experiments at layer level on CIFAR and ImageNet datasets. We also believe that our method is potentially useful in other problems involving quantization.


page 1

page 2

page 3

page 4


Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

Deep convolutional neural network (CNN) inference requires significant a...

Scalable Methods for 8-bit Training of Neural Networks

Quantized Neural Networks (QNNs) are often used to improve network effic...

Batch Normalization in Quantized Networks

Implementation of quantized neural networks on computing hardware leads ...

Learning In Practice: Reasoning About Quantization

There is a mismatch between the standard theoretical analyses of statist...

Low-Precision Batch-Normalized Activations

Artificial neural networks can be trained with relatively low-precision ...

Quantized Memory-Augmented Neural Networks

Memory-augmented neural networks (MANNs) refer to a class of neural netw...

PAMS: Quantized Super-Resolution via Parameterized Max Scale

Deep convolutional neural networks (DCNNs) have shown dominant performan...

1 Introduction

Deep neural networks have achieved great success in many applications, such as computer vision

Krizhevsky et al. (2012), speech recognition Hinton et al. (2012)

and Natural Language Processing

Bahdanau et al. (2014)

. In computer vision, the most proposed architecture, Convolutional Neural Network (CNN), has demonstrated state-of-the-art results in many tasks. However, CNN-based recognition systems need large amounts of memory and computational power, which may take up to weeks on a modern multi-GPU server for large datasets such as ImageNet

Deng et al. (2009). Hence, they are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption.

Accordingly, a variety of literature has made attention on the reduction of model size through the use of quantization Lin et al. (2015); Courbariaux et al. (2015); Lin et al. (2016), low-rank matrix factorization Denton et al. (2014); Jaderberg et al. (2014), architecture pruning Han et al. (2015a, b), etc. Quantization is one of the simpler ways to reduce complexity of any model with less precision requirements for weights and activations as well as speeding up the computation.

The methods for quantizing gradients, weights and activations Lin et al. (2015); Courbariaux et al. (2015); Hubara et al. (2017); Zhou et al. (2016); Rastegari et al. (2016); Banner et al. (2018), have achieved much closer performance to full precision networks. Whereas, after the previous efficient quantized neural network (QNN) training, there still has floating point operators in model inference particularly due to the float-type numbers in Batch Normalization (BN).

Since (deep) networks are hard trained, the BN operator Ioffe and Szegedy (2015)

is usually used. Implementation of the conventional BN requires much computation of mean and variance, involving the sum of squares, square-root and reciprocal operations. These operators require float-type numbers for high precision. Previous attempts to use low precision networks do not use BN layers

Wu et al. (2018) or keep them in full precision Zhou et al. (2016). In addition, Hubara et al. (2017) proposed shift-based BN by replacing almost all multiplication with power-of-2 approximation and shift operations, and Banner et al. (2018) devised Range BN for the variance normalizing according to the range of input distribution. These modifications to BN for less computational costs are heuristic, confirmed only with experiments. Besides, modern hardware has less deployment cost with fixed-point style or perhaps some only support fixed-points Ignatov et al. (2018). Hence, previous QNN training methods with float-type BN layers fail to deploy on total fixed-number hardware.

In this paper we develop a direct way to handle the floating numbers in BN after obtaining a benchmark QNNs with previous training methods. We view conventional BN operator combined with several quantization operators used in feedforward step as an affine transformation with only two floating numbers. In order to eliminate the floating numbers in BN, we start by considering the exact substitution of these two floating numbers to as less as possible fixed numbers. We recommend to quantize floating numbers with shared quantized scale and mathematically show that all floating points are able to convert to fixed-point operators in an idealized case. In addition, we also demonstrate the lower bound and upper bound of the least possible quantized scale in our scheme. After given the proposed quantized scale, we give few search attempts to decide remaining integers. By the way, our methods used for model deployment also guarantee the precision in QNNs’ experiments, which can be seen as a supplement of modern QNN training.

Our main contribution is summarized as follows:

  • We propose a new method for quantizing BN in model deployment by converting the two floating points affine transformations to a fixed-point operation with shared quantized scale.

  • We give theoretical guarantee of our quantization method, including the existence of quantized scale and the magnitude of optimal quantized bit-width. In addition, we accurately search the least possible quantized scale according to quantized bits to support our transformation numerically.

  • We conduct our method on CIFAR and ImageNet datasets based on benchmark quantized neural networks, showing little performance degradation if exists.

  • Our scenario is restricted not only in BN, but other affine operators with floating numbers in similar quantization structure, such as several variants of BN: BN Hoffer et al. (2018), Instance Normalization Ulyanov et al. (2016) and Range BN Banner et al. (2018).

The remainder of our paper is organized as follows. In Section 2, we formulate the problem and present the motivation of our transformation. In Section 3, we give the equivalence of problem in Section 2 and show the upper bound and lower bound of satisfied solution through several properties. We then give a trivial search algorithm in Section 4, for finding accurate value of the solution and verifying previous results in Section 3. We also briefly discuss the precision loss in practical hardware implementation and validate numerically in common quantized neural networks in Section 5. Finally, we conclude our method in Section 6.

2 Problem Formulation

Quantization mainly focuses on compressing parameters in neural networks, such as weights in convolutional and fully-connected layers, activation outputs of hidden layers, and parameters in BN. A common design for QNN is uniform quantization Zhou et al. (2017); Krishnamoorthi (2018), which makes quantized values evenly distributed among possible values, that is,

where (the set of positive integers) is quantized scale, which measures the smallest gap between quantized values, and means the floor function. The input is restricted to predefined interval due to limited computing power. Particularly, for -bit uniform quantization, choose for friendly hardware support and simply choose for non-symmetric quantizer, for symmetric quantizer.

The weights which should be uniformly quantized may first be mapped into the required input value range through normalized operator Hubara et al. (2017)

or combined with hyperbolic tangent function Zhou et al. (2016)

Here the maximum is taken over all weights in the same layer.

Activations are quantized more difficultly than weights due to nondifferentiable optimization incurred by quantizer functions Cai et al. (2017); Vanhoucke et al. (2011). Several works Lin et al. (2016); Rastegari et al. (2016); Zhou et al. (2016)

have addressed this problem using bit-width allocation across layers or a continuous approximation to the quantizer function in the backpropagation step.

A common quantization approach for activations Mishra et al. (2017); Hubara et al. (2017); Zhou et al. (2016) quantizes output of the previous layer after clipping inputs into a limited interval, that is,

where , and are usually integers, and

is some quantizer function defined earlier. For common used ReLU function,

, or Howard et al. (2017); Sandler et al. (2018) (ReLU6) or some specific positive integers.

Since weights and activations are of the form with constant quantization bits in denominator, feed-forward calculation is able to execute with complete fixed-point operation without regard of the fixed denominator, which achieves hardware acceleration.

Benchmark models employ BN for deep neural network training. The conventional BN operator is two affine transformations of four parameters:

where and

are the mean and standard error estimators of output in the same layer and updated according to training batch statistics. Except conventional BN, several BN variants use similar transformation with different calculation method, such as

BN Hoffer et al. (2018) with norm of feature as , Range BN Banner et al. (2018) with maximum gap in feature as , Group Normalization Wu and He (2018) and Instance Normalization Ulyanov et al. (2016) with different part of feature for calculating and . Anyhow, all parameters are fixed during inference and deployment.

We emphasize that floating points come from the division operation in BN and its variants, since is the running mean of batch feature in the same layer, and are updated based on quantized gradients, but the reciprocal of fails.

Previous state-of-the-art networks such as ResNet He et al. (2016) and VGG Simonyan and Zisserman (2014)

employ BN after convolutional layer or fully-connected layer, followed by an activation function. Let

denote the set of integer numbers. Suppose we use and to quantize activations and weights. Then from quantizer function , we can see outputs of previous layers are of the type with and quantized weight with . Therefore, the outputs of convolutional or fully-connected layer are of the type , where is the number related to the calculation of outputs in the next layer, and is the bias term if exists and may also be quantized. We record this output of the type with . Here we do not identify specific corresponding position of parameters if no ambiguity.

After quantized convolutional or fully-connected layer is BN and quantized activation, then an element of the next layer outputs is


Here we exchange the order of and round function because and are integers in second equality. We also simplify two-affine operator of BN to one-affine operator and replace some terms as follows:

Generally, floating number operation will cause loss of precision, so this modification may lead to some error due to the limited computation precision. We will give experiments in subsequent section briefly.

Without loss of generality, we assume clipped from ReLU function. Other cases can be obtained by shifting the clip range by

Then and .

An empirical attempt is trying to make all operators only related to fixed-points, which means quantizing floating points and to some integers. A simple way is to let floating numbers and be integers respectively, which seems difficult to work because input variable may vary in a wide range. A weaker requirement is to quantize and similar to quantizer function and do, namely approximate and by rational numbers:

Since this consideration would add two more numbers and no hardware support in advance if either or is decided. Another way is to consider whether the surrogates of and can share same quantized scale. For example, same prefixed denominator for whole potential floating numbers and , which is consistent with previous scheme, because

where we set , , . Then and share same pre-fixed quantized scale .

Thus we want to replace the float numbers to integers for friendly hardware execution, where should not changed when and vary to support fixed hardware design.

Problem 1

Given , the problem is to find some (as small as possible), such that for all , there exist some , such that for all ,


In particular, we are able to examine whether can be directly converted to integers if already satisfies.

After having obtained , we also need to decide the remaining quantized number based on the given , which is the second problem we need to tackle.

Problem 2

Given , and a suitable which is one solution in Problem 1, the problem is to find , such that for all ,

Particularly, in the conventional -bit quantization setting, we would set for some small and give the least bits number of satisfied .

3 Theoretical Analysis

In this section, we analyze Problem 1 and Problem 2 with some propositions while leave detail in Appendix B.

Figure 1: Left: exact varies with with upper bound and lower bound showed in Propositions 2 & 3. Right: exact varies with in log scale, showing the quadratic growth of .

3.1 Analysis of Problem 1

For Problem 1, a direct way to deal with the randomness of is to specify all clipped intervals when varies.

We conclude a simple version of Problem 1 in Proposition 1 (see Appendix B.2 for proof).

Proposition 1

Problem 1 is equivalent to finding such a that for all , we can obtain , s.t.


Here means the ceiling function. We first analyze Eq. (3) without considering and leave the constraint in Appendix E, which have little influence in practice.

Set . According to Proposition 1, we only need to analyze the sequence (Here we replace by for notation simplicity). It seems hard to give analytic formula of , though we only want to have a rough understanding of the least possible for future hardware support. Let the minimal which satisfies all sequences of length be . First, we need roughly understand whether our transformation is reasonable. The following proposition guarantees the existence of and also gives the magnitude of w.r.t. .

Proposition 2

, exists. Moreover, if , then as long as , we can find satisfying the requirements. Hence, .

The key insight of the existence of is that while is large enough, then leave more choices to search for empirically on account of finite constraints in Eq. (3). The technique of proofs for the above proposition is to simply remove with conditions according to all possible sequences of and apply discreteness of integers. The proof is given in Appendix B.2.

As for lower bound of , we can take some specific sequences to see how large the possible is. The insight of choosing several sequences is making the maximum gap between the items of such sequence small, in order to quick check the existence of .

Proposition 3

If , we have . Furthermore, if , then .

Remark 1

Propositions 2 and 3 show that ; more precisely, except some small . As for practical employment, when quantizing activations with -bit, we need to choose at least -bit.

Remark 2

From Proposition 2, we confirm that all are the solution of Problem 1. However, we will see that such a is not always sequential when in Appendix D.

Besides, some special input range would obviously decrease the magnitude of , such as the case Appendix G.2 showed.

3.2 Analysis of Problem 2

Intuitively, given and leaving out floor function and clip function, we have

Hence, we can get . Because of , we obtain (or ) and (or ). However, after subsequent search for all possible , we find it is not always the nearest integer (see Proposition 8 and Appendix B.8). The gap of suitable and intuitive approximation depends on specific . By the way, the intuitive way to obtain is correct most of the time.

To establish given that is suitable based on the previous understanding, we offer a conservative way to obtain all appropriate as Proposition 4 and Algorithm 5 (see Appendix F) mentioned .

Proposition 4

Given and proper , a pair of satisfy Problem 2 if and only if there exists such a that the following conditions met:


where .

Moreover, we show that the intuitive way to obtain is partially right in Proposition 8, and give a loose bound of the possible ranges of and in Proposition 9 (see Appendix A). In addition, for general quantizer range instead of in Problem 1, we can apply similar analysis in Appendix G.1

3 (2-bit) 2 7 (3-bit) 9 15 (4-bit) 51 31 (5-bit) 289
63 (6-bit) 1459 127 (7-bit) 6499 255 (8-bit) 28323
Table 1: Searched with commonly used number for -bit quantization ().

4 Numerical Computation

In the previous section we have discussed theoretically. In this section, we present how to compute the accurate to evaluate the theoretical analysis.

For convenience, we are able to only consider and by Proposition 5. Then , , , for .

Proposition 5

Considering and is enough for finding .

For insurance purpose, we save all possible sequences as a naive idea, then to check every if the corresponding exist. We generate the sequences of recursively based on the following result and the scope of .

Proposition 6

Suppose is a satisfied sequence of length . Then is a satisfied sequence of length . In addition, if is a satisfied sequence of length , there must be a sequence of length and with the same previous terms.

Now we turn to find in an accurate way. On account of Proposition 3, we start search from when is larger than . We use brute force and follow proposition below to search which satisfies all possible sequences .

Proposition 7

Given a sequence , if


then there exist that can generate the sequence above.

The whole process is showed below:

  1. First, produce candidate sequences. Note that from Proposition 6, we can get all the recursively.

  2. Second, record all satisfied sequences . Given a candidate sequence , we use Proposition 7 to confirm there exist which can reproduce this sequence.

  3. Third, based on the obtained sequence , check whether the given is satisfied through Proposition 4.

Due to the large amount of the possible sequences , we suggest searching a small part of all sequences and running the window successively until all sequences are satisfied. The main pseudo-algorithm and the auxiliary pseudo-algorithms for search exact are all shown in Appendix F.

4.1 Experimental Results

We give a short Table 1 corresponding to specific bit quantized activations. More with other please see Appendix C. Due to numerical errors, we also reexamine a wide range of input in with large enough in order to ensure every possible input.

We also draw the growth trend of vs in Figures 1. Since saving all sequences is time-consuming and space-consuming, we only search in and . The magnitude depicted in the figures is consistent with Propositions 2 and 3, showing the quadratic growth of .

In practice, there is no necessary requirement of finding , since we only need some proper to convert the float-type parameters in BN as well as provide hardware friendly , so not exactly is the best.

5 Experiments

In this section we assess the effect of our BN quantization method through previous trained QNNs on ImageNet Russakovsky et al. (2015) (we also try on CIFAR-10 Krizhevsky et al. (2009) in Appendix H). Our method is applied to model deployment at layer level, which is fitted for various quantized neural networks using the conventional BN or its variants. Besides, we aim at verifying accuracy variation through our modification for friendly hardware support, rather than obtaining high accuracy as previous quantization training methods do.

There are two main concerns we should verify. First, the derivation in Eq. (1) absorbs the parameters of BN into the neighboring convolutional layer or fully-connected layer, which may have precision degradation. We refer to means absorbing BN parameters used in such layer though Eq. (1) based on quantized training models. Second, because the original theoretical analysis is based on real numbers rather than floating numbers, we should check whether the theoretical results match with the expression of floating numbers. Though numerical experiments have already examined in Section 4, we display the final accuracy to support our method again. We adopt which uses our fixed-points layerwise replacements in Problem 1 based on in the following experimental tables.

We use DoReFa-Net Zhou et al. (2016), one of efficient QNNs to train a quantized model for verification, and denote as the accuracy with DoReFa-Net quantization methods.

DoReFa-Net has low bitwidth weights and activations using low bitwidth parameter gradients to accelerate both in training and inference. In order to get more suitable baselines, we only choose quantizing weights and activations while using full-precision gradients in training, and we do not quantize the final fully-connected layer for better training performance. We adopt to represent -bit quantized weights and -bit quantized activations. We examine , and in subsequent experimental tables. We choose and for -bit and -bit quantized activations according to the discussion in Appendix D.

5.1 Quantization for ImageNet Classification

The next set of experiments study our method using VGG16 Simonyan and Zisserman (2014) and ResNet18 He et al. (2016) on ImageNet dataset. We adopt data augmentation including random cropping, horizontal flipping, lighting and color jittering with Adam optimizer using step-wise learning rate initialized from divided by on each epochs with total epochs. After we obtain the trained QNN, we convert final model into and mode we mentioned above. The main results are shown in Tables 2.

5.2 Accuracy and Complexity Analysis

From Table 2, using the QNN training method is able to get comparable performance when quantized even with and . In addition, once a QNN has trained, we absorb the only floating-points in BN by our attempts. From the experimental results, enjoys slight difference with the original quantized model , which means one affine transformation in Eq. (1) brings tiny disturbance. Moreover, we observe that our substitution in Problem 1 also introduces the same results between and . In principle, there should have diversity across and when encountering operators which excess the computer precision. We prefer to use our case due to entire fixed-point implementation and if have, slight performance degradation.

Additionally, practical application of QNN mainly uses small bits (up to 8-bits to our knowledge with ) for quantization. From Proposition 4 (or Proposition 9 in Appendix A), a single conversion in worse case is , but time is able to be saved if we search around intuitive way and , while we only need to convert model once with few BN parameters.

Bits 2W4A 4W4A 8W8A
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
69.46 88.84 70.62 89.46 70.83 89.59
69.43 88.88 70.52 89.45 70.81 89.58
69.43 88.88 70.52 89.45 70.81 89.58
65.94 86.54 68.15 88.09 68.67 88.18
65.99 86.55 68.15 88.11 68.66 88.17
65.99 86.55 68.15 88.11 68.66 88.17
Table 2: Top-1 and Top-5 test accuracy of different quantized weight and activation bits with our transformation on ImageNet trained by VGG16 (top) and ResNet18 (bottom).

6 Conclusion

In this paper we have investigated the problem of quantizing floating points in BN, combined with quantized convolutional or fully-connected weights and activations. Noting that the conventional BN includes two affine transformations, we have made all floating points into one affine operator with only two floating-points. We have accordingly proposed to quantize each floating-point in the converted one-affine operator sharing a quantized scale. We have shown that possible quantized scale is roughly twice of activation bits in theory and given numerical computation schemes of the corresponding substitution. Our approach enjoys the errorless performance in inference using high precision. The strategy we recommended displays efficient model deployment with complete fixed-point operators.

It is worth emphasizing that our quantization scheme is suitable for other affine operators which are also common in deep NNs. Beyond BN as well as the NN context, we believe that our quantization scheme has potential applications in other problems that involve quantization.

Broader Impact

a) & b) If the proposed quantization method is verified to be useful in numerous real applications by the engineers in the future, it will produce good impacts on model compression. Hence, previous style of QNNs would pay attention to entire fixed-point effective QNN design with BN. c) The method we proposed leverages all possible outputs and shows convincing results in theory, though our method may fails when the scope of fixed-point doesn’t support the converted large range of that we seldom see this in practice. Besides, floating operation is likely to introduce precision error, which our method in first conversion step would encounter. d) Our method is data irrelevant.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • R. Banner, I. Hubara, E. Hoffer, and D. Soudry (2018) Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pp. 5145–5153. Cited by: 4th item, §1, §1, §2.
  • Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017) Deep learning with low precision by half-wave gaussian quantization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5918–5926. Cited by: §2.
  • M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1.
  • S. Han, H. Mao, and W. J. Dally (2015a) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015b) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §5.1.
  • G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury, et al. (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine 29. Cited by: §1.
  • E. Hoffer, R. Banner, I. Golan, and D. Soudry (2018) Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pp. 2160–2170. Cited by: 4th item, §2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations.

    The Journal of Machine Learning Research

    18 (1), pp. 6869–6898.
    Cited by: §1, §1, §2, §2.
  • A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool (2018) Ai benchmark: running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.
  • M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix H.
  • R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • D. Lin, S. Talathi, and S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §1, §2.
  • Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio (2015) Neural networks with few multiplications. arXiv preprint arXiv:1510.03009. Cited by: §1, §1.
  • A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr (2017) WRPN: wide reduced-precision networks. arXiv preprint arXiv:1709.01134. Cited by: §2.
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1, §2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §5.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Appendix H, §2, §5.1.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: 4th item, §2.
  • V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. Cited by: §2.
  • S. Wu, G. Li, F. Chen, and L. Shi (2018) Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680. Cited by: §1.
  • Y. Wu and K. He (2018) Group normalization. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • S. Zhou, Y. Wang, H. Wen, Q. He, and Y. Zou (2017) Balanced quantization: an effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32 (4), pp. 667–682. Cited by: §2.
  • S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §1, §2, §2, §2, §5.

Appendix A Additional Propositions

Proposition 8

Given and proper , suppose which satisfy Eq. (2) exist. Then either or is candidate value, and either or satisfies the requirements. However, the combined from these candidate values maybe not satisfy Problem 2.

Proposition 9

Given and proper , suppose which satisfy Eq. (2) exist. Then the candidate values of and satisfy

Appendix B Proof of Propositions

The following proofs may need use property of ceiling function below.

Lemma 1 (The properties of ceiling function)

b.1 Proof of Proposition 1

Proof: Without loss of generality, we may assume , otherwise we can reverse . Then we can see as , and must have the same sign, so while we leave the case in Appendix E.

Then according to the property of floor function, the original problem is equivalent to the follow situations:


We can see ,

Hence .

b.2 Proof of Proposition 2

Proof: It is obvious that if , then .

When , we only need to consider the problem in Proposition 1, that is, ,

Set ,



Pay attention to , so that

which means

Analyzing the cases of and , we obtain


It follows from Lemma 1 in Appendix B that , ,


, ,


Therefore .
We make


Then always exists because the scope of is larger than one and includes at least one integer.

A naive idea is that when ,

When and , we have that

So if , we are able to find given any . Therefore, exists.

Before the remaining proof, we need lemma below.

Lemma 2

Suppose and , then and

where is the fractional part of .

Now we turn to the left part of Proposition 2.

Using Proposition 5, we only need consider . Let us take more precise analysis from Eq. (10). Suppose


When and ,

So we only need to check .

  1. If , .

  2. If , .

    Set . Use Proposition 5, then .

    • If , then ,   or , then .


      No solution!

    • If , then , or , then .


      No solution!


  3. If , .

    Use Lemma 1,



    Set , which lead to


    So we can see

    Next we will prove that satisfy the follow cases.

    Case1. If or , then

    Case2. If , then

    From Case1, we may assume . Meanwhile take Eq. (13) into the condition , we get , and

    Since , therefore



    • If , then . By Eq. (13),

      so , to Case1.

    • If or , by the previous fact that and Eq. (13), we can get the range of :

      We set


      The last inequality comes from .

      Reuse Eq. (15) and Eq. (16),

      • When , because from Eq. (15), , so , then


      • When ,

        If ,

        Then the range of is which less than , and the fraction of the left point of can’t be zero by Lemma 2. So , else , which means , to Case2.
        Then . As , so .

        Due to , hence

        Therefore , but this time

        No solution!
        Therefore , so , to Case2.

      • Similarly we can get same results when with

From previous discussion, we obtain

Hence, exists from Eq. (10) when and .

b.3 Proof of Proposition 3

Proof: Consider through special choices of when .

  • , , with sequence ,

  • ,