The quantized models are either directly derived from full-precision counterparts [35, 18], or through regularized training [32, 4, 1]. Training-based methods better exploit the domain information and normally achieve better performance. However, the quantized models still suffer from significant accuracy reduction. To alleviate this problem, instead of manually designing the number of bits in different layers, recent works resort to automated search methods [8, 42, 40, 37, 24]. Despite all the effort to improve quantized models, we notice that a fundamental problem of network quantization remains untouched, that is, whether accuracy reduction is due to the reduced capacity of quantized models, or is due to the improper quantization procedure. If the performance is indeed affected by the quantization procedure, is there a way to correct it and boost the performance?
Comparison of approaches with ResNet50 on ImageNet dataset under different quantization levels. Left: Quantization on both weight and activation. Right: Weight-only quantization. The bit-width represents equivalent computation cost for mixed-precision methods (AutoQB and HAQ).
In this paper, we first investigate the condition of efficient training of a generic deep neural network by analyzing the variance of the gradient with respect to theeffective weights
involved in convolution or fully-connected operations. Here, effective weights denote the weights directly involved in convolution and other linear operations, which can be transformed weights in either quantized or unquantized training. With some semi-quantitative analysis (together with empirical verfication) of the scale of logit values and gradient flow in the neural network during training, we discover that proper convergence of the network should follow two specific rules of efficient training. It is then demonstrated that deviation from these rules leads to improper training and accuracy reduction, regardless of the quantization level of weights or activations.
To deal with this problem, a simple yet effective technique named scale-adjusted training (SAT) is proposed, with which the rules of efficient training are maintained so that performance can be boosted. It is a general approach which can be integrated with most existing quantization-aware training algorithms. We showcase the effectiveness of the SAT approach on a recent quantization approach PACT . On the other hand, we also discover the quantization error introduced in calculating the gradient with respect to the trainable clipping level in the PACT technique, which also degenerates accuracy, especially for low precision models. A schedule with calibrated gradient for PACT, namely CG-PACT, is proposed to correct the inaccurate gradient. With the proposed SAT and CG-PACT, we achieve state-of-the-art performance of quantized neural networks on a wide range of models including MobileNet-V1/V2 and PreResNet-50 on ImageNet dataset, with consistent improvement over previous methods [12, 4, 49, 55, 40, 24, 9] under the same efficiency constraints. As an example, Fig. 1 compares our method with some recent quantization techniques applied to ResNet-50 on ImageNet classification task. We believe our analysis and approaches will shed light on some existing issues of network quantization, and facilitate more advanced algorithms and applications.
2 Related Work
Quantization of deep models has long been discussed since the early work of weight binarization[5, 6] and model compression .The majority of previous methods enforce the same precision for weights/activations in different layers during quantization. Early approaches focus on minimizing the difference in values  or distributions [12, 49] between quantized weights/activations and full-precision ones. Recently,  proposes a learning-based quantization method, where the quantizer is trained from data. Regularizer for quantization is also proposed to implement binarized weights . Ensemble of multiple models with low precision has also been studied , demonstrating improved performance than single model under the same computation budget. 
proposes a quantizer with trainable step size, and improves training convergence by balancing the magnitude of step size updates with weight updates, based on some heuristic analysis. However, this method focuses on training the step size, and scales the gradients, instead of analyzing the impact of model weights themselves on the training dynamics. Previous work have paid little attention to the impact of quantization procedure on training dynamics, and it remains unclear if previous algorithms are efficient in training during quantization procedure.
Recent work attempts to use mixed-precision in one model, and weights and activations in different layers are assigned different bit-widths, resulting in better trade-offs between efficiency and accuracy of neural networks. Towards this end, automated algorithms are adopted to determine the most appropriate bit-width for each layer. Reinforcement learning are adopted to search bit-width configurations with guidance from memory and computation cost[8, 24] or latency and energy produced by hardware simulators .  and  apply differentiable neural architecture search method to efficiently explore the searching space. Although these methods result in more flexible quantization architectures, the training procedure of quantized models still follows paradigms of uniform-precision quantization.
Efficient Training. Training dynamics of neural networks has been extensively studied, both on the impact of initialization schemes  and on the convergence condition with mean field theory . Early work mainly focuses on networks with simple structure, such as with only fully-connected layers 
. More advanced topics including residual networks and batch normalization are discussed with more powerful tools[45, 46], facilitating efficient training of CNN with 10k layers through orthogonal initialization of kernels . From a different angle of view,  proposes a properly rescaled initialization scheme to replace normalization layers, which also enables training of residual network with a huge number of layers. These approaches mainly focus on weight initialization methods to facilitate efficient training. We extend this analysis to the whole training process, and discovered some critical rules for efficient training. These rules guide our design of an improved algorithm for network quantization.
3 Efficient Training of Network Quantization
3.1 Basic Quantization Strategy
Historically, quantization of neural networks follows different conventions and settings 
. Here we describe the convention adopted in this paper to avoid unnecessary ambiguity. We first train the full-precision model, which is used as a baseline for the final comparison. For quantized models, we use the pretrained full-precision model as initialization, and apply the same training hyperparameters and settings as full-precision model (including initial learning rate, learning rate scheduler, weight decay, the number of epochs, optimizer, batch size, etc.) to finetune the quantized model. For the input image to the model, we use unsigned 8bit integer (uint8), which means we do not apply standardization (neither demeaning nor normalization). Previous work sometimes avoid quantizing the first and last layers due to accuracy drop[52, 42]. We follow a more practical setting to quantize weights in both layers with a minimum precision of 8bit  in our main results. To investigate the effect of quantization level in these two layers, some additional results are shown in S6. The input to the last layer is quantized with the same precision as other layers. As a widely adopted convention [4, 40], bias in the last fully-connected layer(s) and the batch normalization (BN) layers (including weight, bias and the running statistics) are not quantized, but as shown in S7, batch normalization layers of most networks can be eliminated except for a few cases. Note that no bias term is used in convolution layers.
3.2 Rules of Efficient Training
, we will discuss a simplified network composed of repeated blocks, each consisting of a linear layer (either convolution or fully-connected) without bias, a BN layer, an activation function and a pooling layer. The last layer is a single fully-connected layer without bias, which produces the logit values for the loss function.
Suppose for the -th block, the input is , and the effective weight of the fully-connected layer is , which is called the effective weight because it is the de facto weight involved in forward and backward propagation of the network. In other words, we have
where , is the depth of the network, and stand for running statistics, and and denote the trainable weights and bias in the BN layer, respectively. is the activation function, which is assumed to be quasi-linear, such as the . is the kernel size of the pooling layer, which can be set to for the usual case with no pooling, and is the set of pooling indices. Note that for the last layer , only Eq. (1a) applies, and are the logits for calculating the cross entropy loss.
Optimization of the cross entropy loss is highly related to the scale of weights in the last fully-connected layer. Suppose the input to this layer is normalized with BN layer before pooling. To avoid saturation issue for calculating gradient, we have the following rule of efficient optimization for the cross entropy loss.
Efficient Training Rule I (ETR I) To prevent the logits from entering the saturation region of the cross entropy loss, the scale (measured by the variance) of the effective weight in the last fully-connected layer should be sufficiently small. Specifically, we have
where is the length of input features of the fully-connected layer, and is the kernel size of the preceding pooling layer ( if the preceding layer is not pooling). Empirically, we find a ratio less than is adequate.
Detailed derivation of this rule is given in S1. Besides optimization of the cross entropy operation, training dynamics is also critical for efficient training of deep neural network. To analyze the training dynamics, we derive the following phenomenological law for gradient flowing during training.
The Gradient Flowing Law For the aforementioned network, the variances of gradients of loss with respect to the effective weights in two adjacent layers are related by
On the other hand, if there is no batch normalization layer, we have
Here, both and are empirical parameters of order with respect to . is the loss function, and and
represent the number of input and output neurons of the-th layer, respectively. and denote the number of input and output channels, and is the kernel size of this layer (which can be set to for fully-connected layer).
Note that similar results have been studied for network initialization in  and extensively studied with mean field theory [33, 45, 43]. Here, we generalize the randomness assumption and empirically verify that the conclusion holds along the whole training procedure (see S1). Note that viewing weights and inputs as random variables along the whole training procedure is also adopted in previous literature of statistical mechanics approaches to analyze training dynamics of neural networks [34, 41, 25].
To avoid gradient exploding/vanishing problems, the magnitudes of gradients should be on the same order when propagating from one layer to another. Suppose , that is, there is no pooling operation (since the number of pooling layers is much smaller than the number of linear layers, their effect can be ignored for simplicity). If we have BN after linear layer, then as long as the variance of weights in different layers are kept on the same order, the scaling factor in Eq. (3) will be . Note that we have implicitly assumed that and do not differ in order. On the other hand, if there is no BN layer directly following the linear layer, the scale of weights should be on the order of the reciprocal of the number of neurons to keep the scaling factor in Eq. (4) on the order of . Thus, we arrive at the basic rule for convergent training.
Efficient Training Rule II (ETR II) To keep the gradient of weights in the same scale across the whole network, either BN layers should be used after linear layers such as convolution and fully-connected layers, or the variance of the effective weights should be on the order of the reciprocal of the number of neurons of the linear layer ( or ).
Detailed derivation of the gradient flowing law and ETR II is given in S1.
With these rules, we are now ready to examine if training of quantized models is efficient. Following previous work PACT , we use the DoReFa scheme  for weight quantization, and the PACT technique for activation quantization. These methods are popular and typical for model quantization, and it is noteworthy that they are adopted only to showcase the effectiveness of our approach, instead of limiting our analysis. Violations of training rules also exists in other quantization approaches such as BinaryNet , XNORNet , Ternary Quantization , HWGQ  as well as a differentiable quantization approach Darts Quant , as weights distributions are changed during quantization/binarization procedure which results in potential violation of Efficient Training Rules. Thus we believe our approach is general and can be used to boost the performance of such quantization algorithms as well.
3.3 Weight Quantization with SAT
The DoReFa scheme  involves two steps, clamping and quantization. Clamping transforms the weights to values between 0 and 1, while quantization rounds the weights to integers. We here analyze the impact of both steps on training dynamics.
3.3.1 Impact of Clamping
Before quantization, the weights are first clamped to the interval between 0 and 1. For a weight matrix , we first clamp it to
which is between 0 and 1. This transformation generally contracts the scale of large weights, and enlarges the difference of small scale elements. Thus, this clamping operation makes variables distributed more uniform in the interval , which is beneficial for reducing quantization error. However, the variance of the clamped weights will be different from the original ones, potentially violating the efficient training rules. Specifically, clamping leads to violation of ETR I in the last linear layer, and voilation of ETR II in linear layers without BN layers followed. For example, clamping on the full pre-activation ResNet  and VGGNet  will both violate ETR II since they have linear layers without BN followed. In cases that convolution layers are followed by BN layers, the clamped weights are commensurate with each other and ETR II is preserved.
To understand the effect of clamping on ETR I, we first analyze a model using clamped weights without quantization, following the DoReFa scheme
We have the effective weight in this case. Fig. 1(a) gives the ratio between variances of the clamped and the original weights with respect to the number of neurons. As a common practice , the original weights
are sampled from a Gaussian distribution of zero mean and variance proportional to the reciprocal of the number of neurons. We find that for large neuron numbers, the variance of weights can be enlarged to tens of their original values, potentially increasing the opportunity of violating ETR I. Since the number of output neurons of the last linear layer is determined by the number of classes of data, we expect large dataset such as ImageNet to be more vulnerable to this problem than small dataset such as CIFAR10 (see S1 for more details).
To verify our analysis, we train the MobileNet-V2 on ImageNet using traditional method as well as using clamped weights, and compare their learning curves. As shown in Fig. 1(b), clamping impairs the training procedure significantly, reducing the final accuracy by as much as 1%. Also, we notice that clamping makes the model more prone to the over-fitting issue, which is consistent with previous literature which claims that increasing weight variance in neural networks might worsen their generalization property [34, 41, 25].
To deal with this problem, we propose a method named scale-adjusted training (SAT) to restore the variance of effective weights. We directly multiplies the normalized weight with the square root of the reciprocal of the number of neurons in the linear layer as in Eq. (7). Here is the sample variance of elements in the weight matrix, calculated by averaging the square of elements in the weight matrix. In back-propagation, is viewed as constant and receives no gradient. This simple strategy is named constant rescaling and works well empirically across all of the experiments. Note that here we have ignored the difference between weight variances across channels and just use variance of the weights in the whole layer for simplicity.
Fig. 1(b) compares the learning curves of vanilla method, and weight clamping with and without constant rescaling. It shows that SAT recovers efficient training and improves validation accuracy significantly after weight clamping. We also experiment with an alternative rescaling approach in S3 and notice similar performance. In the following experiments we will always use constant rescaling. For MobileNet-V2, we only need to apply SAT to the last fully-connected layer. For other models where convolution is not directly followed by BN such as full pre-activation ResNet , we need to apply SAT to all such convolution layers (see S6 for more details). Note that the proposed method does not mean to be exclusive, and there are other methods to preserve the rules of efficient training. For example, different learning rates can be used for the layers with no BN layer followed. Before further discussion, we want to emphasize that the clamping is only a preprocessing step for quantization and there is no quantization operation involved up to now.
3.3.2 Impact of Weight Quantization
With weights clamped to , the DoReFa scheme  further quantizes weights with the following function
Here, indicates rounding to the nearest integer, and equals where is the number of quantization bits. Quantized weights are given by
In this case, we have the effective weight given as . To see the impact of quantization on the training dynamics, we compare the variance of the quantized weight with the variance of the full-precision clamped weight in S2. We find that for precision higher than 3 bits, quantized weights have nearly the same variance as the full-precision weights, indicating quantization itself introduces little impact on training. However, the discrepancy increases significantly for low precision such as 1 or 2 bits, thus we should use the variance of quantized weights for standardization, rather than that of the clamped weight . For simplicity, we apply constant scaling to the quantized weights of linear layers without BN by
For typical models such as MobileNets and ResNets, only the last fully-connected layer needs to be rescaled, and such rescaling is only necessary during training for better convergence. For inference, the scaling factor (which is positive) can be discarded, with the bias term being modified accordingly, introducing no additional operations. For models with several fully-connected layers such as VGGNet , or with convolution layers not followed by BN layers, such as fully pre-activation ResNet, the scaling factors for these layers can be applied after computation-intensive convolutions or matrix multiplications, adding marginal computation cost.
3.4 Activation Quantization with CG-PACT
For activation quantization,  proposes a method called parameterized clipping activation (PACT), where a trainable parameter is introduced. An activation value is first clipped to the interval with the hard function, then scaled, quantized and rescaled to produce the quantized value as
Quantized value obtained this way is called dynamic fixed-point number in  because is a floating number.
Since is trainable, we need to calculate gradient with respect to it besides the activation
itself. For this purpose, the straight through estimation (STE) method is usually adopted for backward propagation, which gives
As , the domain of definition for is also restricted to
. Chain rule gives (seeS4 for more details)
Note that we give different result for the case of in Eq. (13) compared to the original PACT , where quantization error is ignored and the derivative of is approximated to , i.e., . Such quantization error will not introduce significant difference for high precisions, but the error introduced might be harmful for low precisions such as 4bit or lower. Notably,  and  present similar results. However, these two works neither present detailed derivation, nor illustrate the impact of such quantization error on the learning curves.
To see the effect of this quantization error, we train MobileNet-V2 on ImageNet with low-precision of 2 and 4 bits, with the gradient calibrated as in Eq. (13) or not as in original PACT . The learning curves are illustrated in Fig. 3, which show that calibration of PACT gradient facilitates efficient training for low precision models. This gradient calibration technique is named CG-PACT for brevity.
|Both Quantization||Weight-Only Quantization|
|Quant. Method||Bit-widths||Acc.-1||Acc.-5||Quant. Method||Weights||Acc.-1||Acc.-5|
|SAT (Ours)||2bits||74.1||91.7||SAT (Ours)||2bits||75.3||92.4|
|SAT (Ours)||3bits||76.6||93.1||SAT (Ours)||3bits||76.3||93.0|
|SAT (Ours)||4bits||76.9||93.3||SAT (Ours)||4bits||76.4||93.0|
|SAT (Ours)||FP||75.9||92.5||SAT (Ours)||FP||75.9||92.5|
PACT and SAT use full pre-activation ResNet, LSQ and HAQ use vanilla ResNet, and LQNet uses vanilla ResNet without convolution operation in shortcut (type-A shortcut).
PACT and LQNet use full-precision for the first and last layers, LSQ and SAT use 8bit for both layers, and HAQ uses 8bit for the first layer.
Based on previous analysis, we apply the SAT and CG-PACT technqiues to popular models including MobileNet-V1, MobileNet-V2, PreResNet-50 on ImageNet. For all experiments, we use cosine learing rate scheduler 
without restart. Learning rate is initially set to 0.05 and updated every iteration for totally 150 epochs. We use SGD optimizer, Nesterov momentum with a momentum weight of 0.9 without damping, and weight decay of. The batch size is set to 2048 for MobileNet-V1/V2 and 1024 for PreResNet-50, and we adopt the warmup strategy suggested in  by linearly increasing the learning rate every iteration to a larger value () for the first five epochs before using the cosine annealing scheduler. The input image is randomly cropped to and randomly flipped horizontally, and is kept as 8 bit unsigned integer with no standardization applied. Note that we use full-precision models with clamped weight as initial points to finetune quantized models.
We compare our method with techniques in recent literature, including uniform-precision quantization algorithms such as DeepCompression , PACT , LQNet , LSQ , and mixed-precision approaches such as HAQ  and AutoQB . Validation accuracy with respect to quantization levels for ResNet-50 is plotted in Fig. 1. It is obvious that our method gives significant and consistent improvement over previous methods under the same resource constraint. More thorough comparisons for quantization on MobileNets with or without quantized activation are given in Table 1 and 2, respectively. Table 3 compares different quantization techniques on ResNet-50. Surprisingly, the quantized models with our approach not only outperform all previous methods, including mixed-precision algorithms, but even outperform full-precision ones when the quantization is moderate ( 5 bits for both quantization and 4 bits for weight-only quantization on MobileNet-V1/V2, 3 bits for either both and weight-only quantization on ResNet-50).
Our method reveals that previous quantization techniques indeed suffer from inefficient training. With proper adjustment to abide by the efficient training rules, the quantized models achieve comparable or even better performance than their full-precision counterparts. In this case, we have to rethink about the doctrine in the model quantization literature that quantization itself hampers the capacity of the model. It seems with mild quantization, the generated models do not sacrifice in capacity, but benefit from the quantization procedure. The claimping and rescaling technique does not contribute to the gain in quantized models since they are already used in full-precision training. One potential reason is that quantization acts as a favorable regularization during training and help the model to generalize better. Note that quantizing both weight and activation gives better results than weight-only quantization in some cases (3 and 4 bits on ResNet-50), which also indicates that the quantization to the activations acts as a proper regularization and improves generalization capability of the network. The underlying mechanism is not clear yet. We left in-depth exploration as future work.
Ablation Study To see the different impacts of the two methods we proposed (SAT and CG-PACT), we experiment on MobileNet-V1 under two different quantization levels of 4bit and 8bit with combinations of the two approaches. As shown in Table 4, both techniques contribute to the performance, and SAT is more critical than CG-PACT in both settings. In mild quantization such as 8 bit, SAT itself is sufficient to achieve high performance, which matches our previous analysis.
This paper studies efficient training of quantized neural networks. By analyzing the optimization of cross entropy loss and training dynamics in a neural network, we present a law describing the gradient flow during training with semi-quantitative analysis and empirical verification. Based on this, we suggest two rules of efficient training, and propose a scale-adjusted training technique to facilitate efficient training of quantized neural network. Our method yields state-of-the-art performance on quantized neural network. The efficient training rules not only applies to the quantization scenario, but can also be used as a inspection tool in other deep learning tasks.
-  (2018) Proxquant: quantized neural networks via proximal operators. arXiv preprint arXiv:1810.00861. Cited by: §1, §2.
-  (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.4.
Deep learning with low precision by half-wave gaussian quantization.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §3.2.
-  (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1, §1, §1, §3.1, §3.2, §3.4, §3.4, §3.4, §4.
-  (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §2, §3.2.
-  (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1, §2.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.3.1.
-  (2018) ReLeQ: an automatic reinforcement learning approach for deep quantization of neural networks. arXiv preprint arXiv:1811.01704. Cited by: §1, §2.
-  (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: §1, §2, §3.4, §4.
-  (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. arXiv preprint arXiv:1908.05033. Cited by: §1.
-  (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §1, §2, §4.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §S1.2, §S1, §2, §3.2, §3.2, §3.3.1.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.3.1, §3.3.1.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1.
-  (2019) Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware. arXiv preprint arXiv:1903.08066. Cited by: §3.4.
-  (2017) Computer vision for autonomous vehicles: problems, datasets and state-of-the-art. arXiv preprint arXiv:1704.05519. Cited by: §1.
-  (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1, §3.1.
Extremely low bit neural network: squeeze the last bit out with admm.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
-  (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1.
Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §4.
-  (2019) AutoQB: automl for network quantization and binarization on mobile devices. arXiv preprint arXiv:1902.05690. Cited by: §1, §1, §2, §4.
-  (2017) Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. arXiv preprint arXiv:1710.09553. Cited by: §3.2, §3.3.1.
-  (2017) Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462. Cited by: §1, §3.4.
-  (2017) Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §1.
-  (2017) WRPN: wide reduced-precision networks. arXiv preprint arXiv:1709.01134. Cited by: §1.
-  (2019) Speech recognition using deep neural networks: a systematic review. IEEE Access 7, pp. 19143–19165. Cited by: §1.
-  (2017) Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5456–5464. Cited by: §1.
-  (2016) Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pp. 3360–3368. Cited by: §2.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1, §1, §2, §3.2.
-  (2016) Deep information propagation. arXiv preprint arXiv:1611.01232. Cited by: §S1, §2, §3.2, §3.2.
-  (1992) Statistical mechanics of learning from examples. Physical review A 45 (8), pp. 6056. Cited by: §S1.3, §3.2, §3.3.1.
A quantization-friendly separable convolution for mobilenets.
2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 14–18. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3.1, §3.3.2.
-  (2019) Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452. Cited by: §1, §2, §3.2.
-  (1992) Stochastic processes in physics and chemistry. Elsevier. Cited by: §S1.3, §S1.3.
-  (2018) Deep learning for computer vision: a brief review. Computational intelligence and neuroscience. Cited by: §1.
-  (2019) HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1, §1, §2, §3.1, §4.
-  (1993) The statistical mechanics of learning a rule. Reviews of Modern Physics 65 (2), pp. 499. Cited by: §3.2, §3.3.1.
-  (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090. Cited by: §1, §2, §3.1.
-  (2018) Dynamical isometry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural networks. arXiv preprint arXiv:1806.05393. Cited by: §S1, §2, §3.2.
Alternating multi-bit quantization for recurrent neural networks. arXiv preprint arXiv:1802.00150. Cited by: §1.
-  (2017) Mean field residual networks: on the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114. Cited by: §S1, §2, §3.2.
-  (2019) A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129. Cited by: §2.
-  (2018-08) Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine 13, pp. 55–75. External Links: Cited by: §1.
-  (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §1.
-  (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §1, §1, §2, §4.
-  (2019) Fixup initialization: residual learning without normalization. arXiv preprint arXiv:1901.09321. Cited by: §2.
-  (2017) Balanced quantization: an effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32 (4), pp. 667–682. Cited by: §1.
-  (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §3.1, §3.2, §3.3.2, §3.3.
-  (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §1, §3.2.
-  (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: §2.
-  (2019) Effective training of convolutional neural networks with low-bitwidth weights and activations. arXiv preprint arXiv:1908.04680. Cited by: §1, §1.
S1 Semi-Quantitative Analysis and Empirical Verification of Logit Scale and Gradient Flow
In this section, we first make some semi-quantitative analysis of the scale of logit values and the gradient flow to derive the laws and rules introduced in the paper. After that, we present some experimental results to verify the laws, together with some comments on our basic assumption. We want to emphasize that our analysis is only semi-quantitative to give insights for practice, and is not guaranteed to be rigorous. Complete and detailed analysis is far beyond the scope of this paper. In contrast, we will focus on order analysis of quantities, instead of accurate values, so we assume all variables are random unless specifically mentioned, and variables in the same tensor (weight matrix or activation matrix in a layer) are distributed identically. We will also widely adopt the assumption of independence between different random variables (with different names, layer indices or component indices), as claimed explicitly in the Axiom 3.2 of. This assumption is also adopted in other literature, such as [13, 33]. It is not rigorously correct, but it leads to useful conclusion in practice . In spite of this, we will give some comments on this assumption in the end of this section.
s1.1 Semi-Quantitative Analysis
As mentioned in the paper, we discuss a simplified network composed of repeated blocks, each consisting of a linear (either convolution or fully-connected) layer without bias, a BN layer, the activation function and a pooling layer. The final stage is a single fully-connected layer without bias, which produces the logit values for the loss function. The pooling operation can be either average pooling or max pooling, but we only analyze the case of average pooling, and leave the analysis of max pooling as future work. We will first analyze the scale of logit values produced by the last fully-connected layer, and then discuss the training dynamics of the network.
s1.1.1 Analysis of the Logit Scale
The logit values produced by the final fully-connected layer is
where is the depth of the network, is the effective weight of this layer, and is its input given by the previous block. The BN, nonlinear activation and pooling operations can be represented by
From this, we have
where is the length of the input features of this last fully-connected layer. For a wide range of models including ResNets and MobileNets, equals to the number of output channels of the last convolution layer. Note that we have extensively used the assumption of independence, and rely on the fact that the activation function is quasi-linear.
Typically, the weight of the batch normalization layer is on the order of , and thus we have
To avoid output saturation, the softmax function for calculating the cross entropy loss has a preferred range of the logit values for efficient optimization. To get a more intuitive understanding of such preference, we plot the sigmoid function and its derivative in Fig.S1 as a simplification of softmax function. It is clear that for efficient optimization, the scale of the input should be well below , from which we can derive that
which is the Efficient Training Rule I.
For more intuitive understanding of the ETR I, here we analyze the impact of clamping operation for some typical models, based on the variance ratio given in Fig. 2a. The results are illustrated in Table 1, where the variance of the original weight is given by
and we need to determine if the value of defined as
is sufficiently small in comparison with . Note that for vanilla models and for models with clamping. From the results we can see that models with clamping have much larger , which results in violation of the ETR I. This explains the degeneration of the learning curve for clamping in Fig. 2b.
s1.1.2 Training Dynamics
Now we analyze the training dynamics of the network. Suppose for the -th block, the input is , and the effective weight of the linear layer is . In other words, we have
Since the parameters and are constants (although trainable), and the activation function is quasi-linear, we have
where is the number of input neurons in the -th layer, i.e. , with and denoting the number of input channels and kernel size of the convolution layer, respectively. For fully-connnected layer, we can set . Here, we have omitted the indices of trainable parameters in the batch normalization layer by simply assuming that they are of the similar order. The second equality holds because the activation function is quasi-linear, and in the third equality we have used the assumption of independence. Meanwhile, in the last three equalities, we have used the definition of and also the assumption that and are all independent with each other.
During training, both and are functions of , and we have
Here, is the batch size.
The variances of the gradients are thus given by
where and is the number of output channels. We use the assumption of independence for different variables here again.
From this we can get
Note that the subscripts of variables inside can be eliminated as variables in the same tensor are distributed identically by assumption.
Here, is some empirical parameter of order with respect to .
On the other hand, if there is no batch normalization layers, and will be removed from Eq. (S1.1.2) and can be formulated as
For typical case, is another empirical parameter of order with respect to .
s1.2 Empirical Verification
We are now ready to verify Eq. (S5) and Eq. (S1.1.2) with experiments. For this purpose, we train a vanilla ResNet18 with Kaiming-initialization  and batch normalization to guarantee convergent training. In this case, we have . We calculate the two parameters and given by
Here is calculated using statistics from the last fully-connected layer and is calculated for all convolution layers along the whole training procedure. We plot the values of and in Fig. S2. As we can see, during the whole training process as expected. in different layers are all of order during the training process, except for the first layer, which involves max pooling and its analysis is out of our scope (we set in the computation for this layer). Nevertheless, we notice that the discrepancy is not large.
s1.3 Comments on the Independence Assumption
Here we want to present some further discussion on the independence assumption, which is extensively used in our analysis. Our discussion is to facilitate more intuitive understanding of such assumption, but not in pursue of rigorous proof.
The first point we want to emphasize is related to the central limit theorem (CLT). Independence is one important condition for the CLT, but a sufficiently weak dependence does not harm
. Typical examples where independence does not hold include the momentum of molecules in an ideal gas described by the micro-canonical ensemble, or random walk with persistence. Even with independence assumption violated, the sum of random variables is able to converge to a Gaussian distribution as the number of random variables increases indefinitely. Thus, with sufficient number of neurons, linear operation can result in normally distributed outputs, which further leads to the normal distribution of gradients. Now since the weights are initialized with normal distribution, updating with normally distributed gradients will result in some new Gaussian random variables, and thus we can focus on the variance to characterize their statistical properties. Experiment shows that the trained parameters still resemble Gaussian distribution (sometimes with some small spurs), especially for the last several layers.
Second, we investigate a simple example to show how the independence property can be derived with zeroth-order approximation. For simplicity, we analyze the product of two random variables and , both of which are some function of another random variable , that is
We can see that and are generally not independent.
The random variables can be viewed as a quantity with some true value given by the mean, but perturbed by some small random noise, that is
and we have
For zeroth-order approximation, we can ignore the second term, which is a second-order quantity, and get
For variance of , if , we have
which is an approximated independence property widely used in our derivation. Note that such zeroth-order approximation is broadly adopted in statistical mechanics and related areas for analyzing macroscopic law of noisy physical systems, where nonlinear mechanisms exist . More accurate results could be derived by expansion of the master equation, which is beyond the scope of this paper. Such zeroth-order approximation is also similar to the annealed approximation method in condensed matter physics .
S2 Impact of Quantization on Weight Variance
In this section, we examine the impact of quantization on weight variance. Fig. S3
shows the ratio between the standard deviations of them with respect to the number of bits for different channel numbers, which determines the variances of the original non-clamped weights. We can find that for precision higher than 3 bits, quantized weights have nearly the same variance as the full-precision weights, indicating quantization itself introduces little impact on training. However, the discrepancy increases significantly for low precision such as 1 or 2 bits. Also, different channel numbers give similar results.
S3 Comparison of Rescaling Method
In this section, we will compare the constant rescaling methods proposed in the paper and another method named rescaling with standard deviation. It standardizes the effective weights and then rescales them with the standard deviation of the original weights as in Eq. (S23), where is the sample variance of elements in the weight matrix, calculated by averaging the square of elements in the weight matrix.
We train MobileNet-V2 on ImageNet with weight clamping and rescale the last fully-connected layer using the two methods, and plot their learning curves as shown in Fig. S4. We find that the two methods give similar results.
S4 Gradient Calculation in PACT
Here we derive the result in Eq. (13) in the paper. The activation value is clamped into the interval and then quantized as follows
Based on the STE rule, we have
Taking derivative with respect to in Eq. (S24), we have
which is the result in Eq. (13) in the paper.
S5 Bitwidths of the First and Last Layers
Here we study the impact of quantization levels of weights in the first and the last layers. Using MobileNet-V1, we compare the two settings of quantizing these two layers to a fixed 8 bits or to the same bit-width as other layers. As shown in Table 2, we find that the accuracy reduction is negligible for quantization levels higher than 4bits. Note that as mentioned in the paper, the input image is always encoded using unsigned 8bit integer (uint8), and the input to the last layer is quantized with the same precision as input to other layers.
|Bitwidths of Internal Layers||8bits Both Layers||Uniform Quantization|
S6 Rescaling Convolution Layers without Batch Normalization Followed
In this section, we study the impact of rescaling to convolution layers without BN layers followed. As shown in Fig. S5, we plot the learning curves of Pre-ResNet50 with 2 bits weight-only quantization on ImageNet, with constant rescaling applied to only the fully-connected layer or all layers. We can see that applying rescaling to only the last fully-connected layer is not sufficient for efficient training, and rescaling on all convolution layers successfully achieves efficient training.
S7 Elimination of Batch Normalization Layers
As mentioned in the paper, batch normalization layers are not quantized in our experiments, and full-precision multiplication involved in BN is a considerable amount of computation during inference. Here, we want to discuss the possibility to eliminate such layers, especially for light-weight models such as MobileNet-V1/V2.
As shown in Eq. (8) in the paper, quantization of activation function already involves full-precision multiplication, so we want to examine if it is possible to absorb the batch normalization operations into this clipping function. The quantized input to a linear layer can be written as