While deep neural networks have become state-of-the-art techniques for a wide range of machine learning applications, such as image recognition, object detection , machine translation [32, 8]
, the computation costs of deep neural networks are continuously increasing, which greatly hampers the development and deployment of deep neural networks. For example, 10,000 GPU hours are used to perform neural architecture search on ImageNet. Quantization is a promising technique to reduce the computation cost of neural network training, which can replace high-cost floating-point numbers (e.g., float32) with low-cost fixed-point numbers (e.g., int8/int16). Recently, both the software society [6, 12, 16, 19, 27, 35] and the hardware society [11, 24, 23, 31] have carried out extensive researches about quantization of deep neural network for inference tasks.
Though various investigations have demonstrated that deep learning inference can be accurately performed with low bit-width fixed-point numbers through quantization, the quantified training remains an open challenge. Some existing approaches quantify the backward-pass to low-bit (e.g., int8) but incur significant accuracy drop, for examples, 3~7% loss for AlexNet[38, 36].  uses int16 for both forward-pass and backward-pass to ensure accuracy. However, there is no guarantee that unified int16 precision works for all the tasks and networks.
Most previous investigations on quantified training use unified precision (i.e., bit-width) for all network layers. Intuitively, using mixed precisions for different layers will promote the network performance. However, it is hard to find the most appropriate precisions for so many layers in so many training iterations. Considering a widely used ResNet50 model, with 4 candidate quantization bit-widths (e.g., 8, 16, 24, 32 for weights, activations and activation gradients), the size of quantization precision combination search space for 450,000 training iterations can achieve .
To avoid prohibitively long space searching of quantization bit-width combinations, we propose an efficient and adaptive technique to determine the bit-width layer by layer separately, which is based on our observation about the relationship between the layer-wise bit-width and the training convergence. Take AlexNet as an example, Figure. 1(a-c) depict the distributions of activation gradients on AlexNet last layer when quantified with different bit-widths. Compared with the original float32, int8 introduces a significant change in data distribution, int12 introduces slightly change of data mean, and int16 shows almost the same distribution with float32. Figure. 1(d) depicts the corresponding training loss, which shows int8 quantization does not converge at beginning, int12 convergences slower than float32 and int16 behaves similar as float32. The above experimental results suggest if a quantization resolution does not change the data distribution of a layer (e.g., int16 for the last layer of AlexNet), quantified training with this resolution for the corresponding layer will almost keep the training accuracy.
Based on the above observation, one can train large-scale deep neural network using fixed-point numbers, with no change of hyper parameters and no accuracy degradation. For each layer in training, our approach automatically finds the best quantization resolution (i.e., the smallest bit-width which does not significantly change the data mean) for weights, activations and activation gradients respectively. Concretely, we first calculate the mean of the data before quantization. Then, we quantify the data using int8 and calculate the quantization error. If the ratio of quantization error exceeds a threshold (e.g., 3%), the quantization bit-width is increased by 8. The above process is looped until the quantization error ratio is below the threshold.
We evaluate our approach on a wide variety of network architectures (e.g. convolution and recurrent networks) and applications (e.g. image classification, object detection, segmentation and machine translation). Our approach quantifies all weights and activations to int8. On average, 12.56%, 87.43% and 0.07% of activation gradients are quantified to int8, int16, and int24 respectively. Experimental results show that the proposed adaptive precision training approach can achieve comparable accuracy with float32 for training from scratch. The accuracy loss is only 0.02% on average (-1.40%1.3%). Results on Intel Xeon Gold 6154 shows that the proposed approach can achieve 2.52 times speedup over float32 training for AlexNet.
We highlight three major contributions of the proposed adaptive precision training:
Flexibility: The quantization precisions for different layers of different networks are automatically adapted to guarantee the network accuracy.
Efficiency: We quantify both the backward-pass and forward-pass with fixed-point numbers in training, which can accelerate training on real hardware. After training, int8 weights can be directly deployed, so no further quantification is needed.
Generalization: Evaluations on various networks and applications demonstrate the proposed adaptive precision fixed-point training is effective and practical.
2 Related Works
Using reduced precision for deep learning has been an active research topic. Prior efforts explore floating-points(e.g., 8-bit and 16-bit) for training [34, 22] and maintain accuracy on a spectrum of deep learning models and datasets. However, as floating-point is more resource-intensive than fixed-point, the deployments always rely on quantization techniques.
A branch of work explores the fixed-point for forward prorogation(FPROP) [16, 17, 6, 33, 35, 35, 37]. The weights and activations are quantified to 1-8 bits. However, the backward-pass, including gradient propagation (BPROP), and weight gradient computation (WTGRAD) still require float32.
There are recent attempts quantifying weight and activation on different layers with different bit-widths. For the inference of a trained network, there are some techniques that heuristically search the space of quantization bit-width combinations[35, 33, 37]. However, these inference techniques only need to consider single iteration, whose search space is much smaller than training. Hence, they are unsuitable for training. For training, some differentiable quantization methods [4, 30, 37] learn the quantization parameters (e.g., step size, dynamic range and bit-width) with gradient descent. However, the quantization parameters for backward propagation are hard to learn using differentiable methods.  quantifies the backward propagation. Different from their method, which assigns layer-wise bit-width before training, our approach dynamically changes the bit-width during training and we evaluate on widely used networks.
Researchers have shown that 16-bit is sufficient for back propagation in most vision training tasks . However, further quantization to 8-bit results in severe degradation [38, 38, 36, 7, 1]. WAGE  claims that first and last layers require higher precision. TBP  shows weight gradient computation (WTGRAD) needs more bits than gradient back propagation (BPROP).
Our approach is different from others in three aspects. First, fixed-point is used in both forward-pass and backward-pass for training. Second, the quantization parameters for different layers are dynamically adapted to guarantee the accuracy. Lastly, we train a variety of vision and natural language processing applications on large scale dataset.
The key of fixed-point training is to find proper quantization parameters that ensure the training accuracy. Therefore, we study the relationship between the ever-changing data distribution of different layers and the training convergence.
Observation 1. Data distribution varies greatly between layers. Figure. 1(a)
depicts the distributions of activation gradients of different layers on AlexNet. The majority of activation gradients concentrate in areas close to zero, and have long tail distributions. Compared to convolution layers, the fully connected layers have larger variances. Figure.1(b) shows the base-2 logarithm of max absolute value of activation gradients on AlexNet, the max value on bottom layers (e.g., conv0, conv1, conv2) is smaller than the max value on upper layers(e.g., fc0, fc1, fc2). Intuitively, for those layers whose range of data is wide and distribution is centralized, higher quantization resolutions are demanded.
Observation 2. Data range of each layer changes during training. Figure. 1(b)
shows the max absolute value of activation gradient evolution during training. At the early stage of training (less than 10,000 iterations, as shown on the left side of the red line), the data range changes rapidly, and after one or two epochs, the data range tends to be stable. This phenomenon suggests that when training from scratch, the quantization range should also be changed frequently within the initial epochs .
Observation 3. Data with large variance requires large bit-width. Figure. 1(c) shows the convergence curves using different bit-width of different layers. Float32 is the training convergence curve of using float32 for all the convolution and fully connected layers. After 5,000,000 iterations, the network’s top1 accuracy on ImageNet is 58.00%. Then, we quantify the activation gradients of conv1 to int8 and keep other layers float32. The training curve of conv1-int8 is the same as float32 and the final top1 accuracy is 58.01%. However, when we quantify the activation gradients of fc2 to int8 and keep other layers float32 unchanged, the training convergence speed is significantly slower than float32, and within the first 5,000 iterations the training does not converge. The final top1 accuracy of fc2-int8 is only 48.27%. When quantifying the activation gradients of fc2 to int12, the training convergence speed is faster than int8 but still slower than float32. The final top1 accuracy of fc2-int12 is only 50.30%. Using int16 for the activation gradients of fc2, finally the training curve is the same as float32 with 58.28% top1 accuracy. In conclusion, int8 is enough to quantify the activation gradient of conv2, however, fc2 requires int16 to maintain the training accuracy. Together with the observation1, we find that data with large variance requires large bit-width, thus the quantization parameters should be dynamically determined by the data distribution.
, all network parameters are initialized as Gaussian distribution with variance relating to the hyper-parameters of layers. Similar network initialization principle and similar learning algorithm ensure that our observations should be applicable on various network architectures.
4 Adaptive Precision Training
In this section, we introduce the adaptive precision training approach as shown in Figure. 3. In training, the main three computing units of single iteration include forward-pass (FPROP), backward-pass for gradient propagation (BPROP) and backward-pass for weight gradient computation (WTGRAD). The inputs of these three units include weight , activation and top layers’ activation gradient of linear layer . In adaptive precision training, we quantify these three inputs to fixed-point numbers111The quantification method is described in Appendix.B . The quantification parameters, such as bit-width and quantization resolution are automatically determined by the proposed Quantization Error Measurement (QEM) and Quantification Parameter Adjustment (QPA) .
In the following part of this section, we will introduce two main components QEM and QPA of our training approach. Algorithm. 1 describes the entire adaptive precision training algorithm. The output of QEM (denoted as ) serves as an explicit indicator for insufficiency of quantization resolution according to data distribution. QPA performs quantization parameter update and determines update frequency (denoted as ) according to the output of QEM.
4.1 Quantization Error Measurement
Based on the observation 1 and observation 3, we propose to adjust quantization parameters according to data distribution. The difference of mean before and after quantization is a good quantization error measurement, which indicates the change of data distribution and suggests the need for adjusting quantization resolution.
Intuitively, as shown in Figure. 4, the orange line and blue line represent two different data distributions. The quantization resolution is . Using certain quantization resolution, the distribution difference can be reflected by the difference of shadow areas. Specifically, the shadow area S1 is approximately equal to S2, but S3 is much larger than S4. Therefore, for blue one, the mean after quantization is much smaller than the original mean . The difference of mean before and after quantization reflects the connection between quantization resolution and data distribution.
Mathematically, assuming that the data is under Gaussian distribution , and data is quantified to . Considering the positive , the mean between is , and after quantification the mean is . The difference of mean before and after quantization is represented as . We use to approximate the local value between with , and assign , then we have:
It is demonstrated that and (see Appendix A for details), so we have . When decreasing or increasing , the difference of mean will be reduced. Therefore, the difference of mean serves as an explicit indicator for adjusting quantization resolution (represented by ) according to data distribution (represented by ).
Equation. 2 is used in determining quantization parameters during training.
Larger indicates the distribution has higher variance , so it is needed to decrease quantization resolution .
4.2 Quantification Parameter Adjustment
According to observation 2, we propose to automatically determine the quantization parameter based on the data evolution. Under the circumstance of fixed-point representation, the quantization variables include data range, quantization resolution and bit-width . These three variables are inter-dependent, as . Therefore, we use only two of them as quantization parameters (i.e., and ). The parameter adjustment process is triggered by insufficient quantization resolution and dramatic change of data range.
For insufficient quantization resolution, we use as indicator. When exceeds certain threshold , the quantization resolution is reduced by increasing bit-width, as ,where is the bit-width growth step. We can either set the initial and recursively adjust bit-width until proper (denoted as Mode1), or we can set the initial as the previous iteration’s proper bit-width (denoted as Mode2). The quantization resolution is adjusted according to the new bit-width as , where is the max absolute value of data to be quantified.
For the change of data range, we propose another indicator for iteration as:
where is the moving average of data during several iterations.
The quantization parameter adjustment interval is automatically determined by both and . In initialization phase (one-tenth of the first epoch), is set to 1. After initialization phase, the adjustment interval is , as and . As shown in experiment, increases during training. Within iterations, the quantization parameters are kept the same, so there is no need to calculate and max absolute value of the data.
We first evaluate the proposed quantization error measurement, and show the computational complexity introduced by adaptive precision. Then, we evaluate the proposed adaptive precision training on a wide variety of deep learning tasks including image classification, object detection, segmentation and machine translation in accuracy results. At last, we show the training acceleration on existing hardware.
5.1 Evaluation of Error Measurement
We use Pearson correlation coefficient in Equation. 4 to show the correlation between network accuracy and quantization error metric .
. M4 is the Kullback-Leibler divergence, withand
are the discrete probability distributions of original data and data after quantization. Specifically, we quantify each single layer of MobileNet-v2 and ResNet50 and do the forward propagation to get the corresponding network accuracy. The quantization is done with different bit-width (i.e., 6, 8), so various degrees of quantization error and the corresponding network accuracy are generated.
shows the linear correlation between network accuracy and several error metrics. Our proposed quantization error measurement M1 has the highest correlation score (0.84 for MobileNet and 0.85 for ResNet50) with the network-level accuracy, which means the proposed error measurement can serve as a reasonable layer-wise accuracy indicator. MobileNet, as light-weight network, is hard to quantified as shown in Table.1, so it can exhibit the most noticeable difference between different evaluation metrics M1, M2, M3 and M4.
5.2 Computational Complexity
We evaluate the extra computations introduced by adaptive precision quantification. The extra computations including QEM, QPA and data quantification. Specifically, we calculate the operation percentage of forward propagation and backward propagation in original training, and the extra operation introduced by forward quantification and backward quantification. Figure. 7 shows the operation percentage for different networks222https://github.com/tensorflow/models/tree/master/research/slim. (Details of operation quantity are shown in Appendix.B.) It is shown that for light-weight network MobileNet, the quantization consumes relatively more computations. For other networks, the extra quantization computation is within 1%.
We evaluate the quantification parameter adjustment frequency during training. As shown in Figure. 7(a), at the initial epochs, the adjustment is triggered almost every iterations, so the adjustment frequency is near 100%. As the training progress, the adjustment frequency is dramatically decreasing, and at the end of training only 0.1% iterations need to adjust the quantization parameters.
Figure. 7(b) shows the percentage of activation gradients quantified to int8 during training on VGG16. Mode1 allows the chance of decreasing the bit-width during training, so more percentage of layers are kept int8 (final top1 accuracy: 70.2%). In Mode2, bit-width never decreases, so at the end of training, 18.75% of layers are kept int8 (final top1 accuracy: 70.6%).
5.3 Accuracy Results
|SSD Detection||float32||Adaptive||Activation Gradient|
Our proposed Adaptive Precision Training approach uses identical hyper-parameters (e.g., learning rate, max training iterations) as the original float32 training settings. For all tasks, we fix the bit-width for weights and activations (i.e., int8), and quantify activation gradients with adaptive bit-width. For all the tasks, we set , , , , , and Mode2 is used in QPA.
5.3.1 Computer Vision
|Adaptive Precision||int816 (CNN) int824 (RNN)||yes||yes||%(ResNet50)||% (Translation)|
As shown in Table. 1, Adaptive Precision Training generates similar results as float32 baseline. The accuracy drop on MobileNet-v2 is consistent with the quantization results in Google’s work (Acc:70.8) . However, using our adaptive precision training, int8 weights can be directly deployed and no further quantified fine-tuning is needed. The proposed QPE and QPA automatically change the bit-width used for different layers. During the whole training, the percentages of different bit-width in quantization of activation gradients are shown in Table. 1777This is the results of Mode2, as Mode2 generates slightly better results than Mode1, as shown in Figure. 8(b). For most layers of most networks, 16-bit is enough. For some layers of AlexNet and SSD, 8-bit is enough.
5.3.2 Machine Translation
We train two widely used machine translation models from scratch with Adam optimizer. The first Sockeye  is a sequence-to-sequence RNN model implemented with MXNet 888https://github.com/awslabs/sockeye, and trained on the WMT’17 news translation dataset (50k sentence pairs). The word vocabularies contain 50K entries for English and German. The second is Transformer 999https://github.com/jadore801120/attention-is-all-you-need-pytorch, utilizing self-attention mechanism. This network is trained on the WMT’16 Multi30k dataset (3.9k sentence pairs). Word-level accuracy and perplexity (PPL) are used as evaluation metrics.
Training curve of Sockeye is shown in Figure. 8(a). Adaptive Precision Training is compared with float32 baseline and an int16 method, which employs int16 to quantified all the layers of activation gradients without bit-width adaption. At the end of Adaptive Precision Training, 0.8% layers of activation gradients are quantified to int24, 10% layers are int8, and others are int16. As shown in Figure. 8(a), the int16 method gradually results in 2% loss of accuracy, while our Adaptive Precision generates the same accuracy (62.05%) as float 32 baseline (61.97%). This comparison shows the proposed bit-width adaption is necessary to guarantee training accuracy and reduces the total bit-width in computation.
The training convergence curve of Transformer is shown in Figure. 8(b). We report the accuracy and PPL on validation set. Adaptive Precision (ACC: 55.54%) is slightly better than float32 (ACC: 54.13%). On average 2.28% of iterations trigger quantization parameter adjustment.
5.3.3 Comparison to Others
Table. 2 shows the comparison to other quantization methods. As the accuracy of float32 baseline are different across works, we cite their relative accuracy degradation compared to their reported float32 baseline. Most works do not quantify backward-pass and are tested on convolution neural networks. Among these, 
is the most similar method. They use int16 for both forward and backward propagation, and report results on convolution networks. Differently, we use int8 for all the forward-pass and demonstrate that for recurrent neural networks, the fixed bit-width (e.g., int16) can not meet the precision requirement for all the tasks. Therefore, it is needed to dynamically measure the bit-width requirement for different networks and tasks.
6 Training Acceleration
Intel Xeon Gold 6154 supports vector int8/int16 operations with AXV2 instruction set, and Nvidia T4 supports vector int8 operations. Table.3 shows the speedup of our method compared with float32 in training. Specifically, we use 100 iterations’ average acceleration ratio of each layer in forward-pass and backward-pass for AlexNet with 256 batch size101010As T4 does not support int16, we only report the forward-pass use int8 operation. Xeon Gold 6154 can only support multiplication between equal bit-width fixed-point numbers, so in this experiment int16 int8 is implemented as int16 int16.. Our approach can achieve 2.52 times speedup over float32 training on CPU, and 2.89 times speedup on GPU. Figure. 10 shows the details of running time for different type of convolution scale with different operation times. Using fixed-point the computation time is significantly shorter than float32, and the extra time introduced by QEM and QPA is relatively small.
7 Conclusion and Future Work
We observe that the data distribution reflects the precision requirement to maintain training accuracy. Therefore, we propose an adaptive precision quantization approach, which automatically determines bit-width layer-wise. Quantifying back propagation in Neural Networks can further accelerate training on hardware supporting flexible bit-width arithmetic operations. The proposed error measurement of quantization would also be extended to low-bit inference (e.g., binary or ternary), and gradient compression in the future.
-  (2018) Scalable methods for 8-bit training of neural networks. In NeurIPS, pp. 5145–5153. Cited by: §2, Table 2.
-  (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §1.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §5.3.1.
-  (2019) Deep neural network quantization via layer-wise optimization using limited training data. In AAAI, Cited by: §2.
-  (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Cited by: §5.3.2.
-  (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv:1805.06085. Cited by: §1, §2, Table 2.
-  (2018) Mixed precision training of convolutional neural networks using integer operations. In ICLR, Cited by: §1, §2, §5.3.3, Table 2.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1.
-  (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §5.3.1.
Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §3.
-  (2015) Deep learning with limited numerical precision. In ICML, pp. 1737–1746. Cited by: §1.
-  (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §1.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, pp. 1026–1034. Cited by: §3.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §5.3.1.
Sockeye: A Toolkit for Neural Machine Translation. arXiv:1712.05690. External Links: Cited by: §5.3.2.
-  (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, pp. 2704–2713. Cited by: §1, §2, §5.3.1, Table 2.
-  (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR, pp. 4350–4359. Cited by: §2, Table 2.
-  (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §5.3.1.
-  (2016) Fixed point quantization of deep convolutional networks. In ICML, pp. 2849–2858. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §5.3.1.
-  (2016) SSD: single shot multibox detector. In ICCV, pp. 21–37. Cited by: §1, §5.3.1.
-  (2018) Mixed precision training. ICLR. Cited by: §2, Table 2.
-  (2017) NVIDIA tesla v100 gpu architecture. Cited by: §1.
-  (2018) Lower numerical precision deep learning inference and training. Intel White Paper. Cited by: §1.
-  (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §5.3.1.
Per-tensor fixed-point quantization of the back-propagation algorithm. In ICLR, Cited by: §2.
-  (2018) A quantization-friendly separable convolution for mobilenets. In EMC2, pp. 14–18. Cited by: §1, §5.1, §5.3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §5.3.1.
Rethinking the inception architecture for computer vision. In CVPR, pp. 2818–2826. Cited by: §5.3.1.
-  (2019) Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452. Cited by: §2.
-  (2018) Bismo: a scalable bit-serial matrix multiplication overlay for reconfigurable computing. In FPL, pp. 307–3077. Cited by: §1.
-  (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §1, §5.3.2.
-  (2019) HAQ: hardware-aware automated quantization with mixed precision. In CVPR, pp. 8612–8620. Cited by: §2, §2.
-  (2018) Training deep neural networks with 8-bit floating point numbers. In NeurIPS, pp. 7675–7684. Cited by: §2, Table 2.
-  (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv:1812.00090. Cited by: §1, §2, §2, Table 2.
-  (2018) Training and inference with integers in deep neural networks. In ICLR, Cited by: §1, §2, Table 2.
-  (2019-06) Quantization networks. In CVPR, Cited by: §2, §2, Table 2.
-  (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160. Cited by: §1, §2, Table 2.
-  (2018) Adaptive quantization for deep neural network. In AAAI, Cited by: §5.1, Table 2.
8 Appendix A
Appendix B. Quantification Method
|Quantization Function||Quantization Scale||Fixed-point Range|
A fixed-point number consists of a sign bit, -bit integer, and a global quantization resolution relating to fixed-point position . Before quantization, the maximum absolute data is . The representation data range, bit-width and quantization resolution are inter-dependent, as . The quantization resolution is calculated as in Table. 4 column 2. Suppose is the floating point representation of and is the fixed-point representation of , and is the approximation of , as , , the multiplication between numbers becomes:
Appendix C. Observations on Other Network
As shown in Figure. 11, for ResNet34 int8 is enough to quantify the activation gradient of g3b2c2, g2b5c1 and g3b2c1, however, int8 for fc and conv0 either not converges or introduces accuracy drop, conv0 and fc have large variance. These observations are consistent with the observation on AlexNet. In conclusion, data with large variance requires large bit-width, thus the quantization parameters should be dynamically determined by the data distribution.
Appendix D. Operation Quantity
Appendix E. Speedup over int16
There is 1.3 times speedup over int16 on CPU for AlexNet (1.13 times speedup for backward and 1.7 times speedup for forward). The int16 x int8 in our method is implemented as int16 x int16 on Xeon Gold 6154. With flexible arithmetic operations like int16 x int8 on future hardware, higher training speedup is promising.