1 Introduction
While deep neural networks have become stateoftheart techniques for a wide range of machine learning applications, such as image recognition
[14], object detection [21], machine translation [32, 8], the computation costs of deep neural networks are continuously increasing, which greatly hampers the development and deployment of deep neural networks. For example, 10,000 GPU hours are used to perform neural architecture search on ImageNet
[2]. Quantization is a promising technique to reduce the computation cost of neural network training, which can replace highcost floatingpoint numbers (e.g., float32) with lowcost fixedpoint numbers (e.g., int8/int16). Recently, both the software society [6, 12, 16, 19, 27, 35] and the hardware society [11, 24, 23, 31] have carried out extensive researches about quantization of deep neural network for inference tasks.Though various investigations have demonstrated that deep learning inference can be accurately performed with low bitwidth fixedpoint numbers through quantization, the quantified training remains an open challenge. Some existing approaches quantify the backwardpass to lowbit (e.g., int8) but incur significant accuracy drop, for examples, 3~7% loss for AlexNet
[38, 36]. [7] uses int16 for both forwardpass and backwardpass to ensure accuracy. However, there is no guarantee that unified int16 precision works for all the tasks and networks.Most previous investigations on quantified training use unified precision (i.e., bitwidth) for all network layers. Intuitively, using mixed precisions for different layers will promote the network performance. However, it is hard to find the most appropriate precisions for so many layers in so many training iterations. Considering a widely used ResNet50 model, with 4 candidate quantization bitwidths (e.g., 8, 16, 24, 32 for weights, activations and activation gradients), the size of quantization precision combination search space for 450,000 training iterations can achieve .
To avoid prohibitively long space searching of quantization bitwidth combinations, we propose an efficient and adaptive technique to determine the bitwidth layer by layer separately, which is based on our observation about the relationship between the layerwise bitwidth and the training convergence. Take AlexNet as an example, Figure. 1(ac) depict the distributions of activation gradients on AlexNet last layer when quantified with different bitwidths. Compared with the original float32, int8 introduces a significant change in data distribution, int12 introduces slightly change of data mean, and int16 shows almost the same distribution with float32. Figure. 1(d) depicts the corresponding training loss, which shows int8 quantization does not converge at beginning, int12 convergences slower than float32 and int16 behaves similar as float32. The above experimental results suggest if a quantization resolution does not change the data distribution of a layer (e.g., int16 for the last layer of AlexNet), quantified training with this resolution for the corresponding layer will almost keep the training accuracy.
Based on the above observation, one can train largescale deep neural network using fixedpoint numbers, with no change of hyper parameters and no accuracy degradation. For each layer in training, our approach automatically finds the best quantization resolution (i.e., the smallest bitwidth which does not significantly change the data mean) for weights, activations and activation gradients respectively. Concretely, we first calculate the mean of the data before quantization. Then, we quantify the data using int8 and calculate the quantization error. If the ratio of quantization error exceeds a threshold (e.g., 3%), the quantization bitwidth is increased by 8. The above process is looped until the quantization error ratio is below the threshold.
We evaluate our approach on a wide variety of network architectures (e.g. convolution and recurrent networks) and applications (e.g. image classification, object detection, segmentation and machine translation). Our approach quantifies all weights and activations to int8. On average, 12.56%, 87.43% and 0.07% of activation gradients are quantified to int8, int16, and int24 respectively. Experimental results show that the proposed adaptive precision training approach can achieve comparable accuracy with float32 for training from scratch. The accuracy loss is only 0.02% on average (1.40%1.3%). Results on Intel Xeon Gold 6154 shows that the proposed approach can achieve 2.52 times speedup over float32 training for AlexNet.
We highlight three major contributions of the proposed adaptive precision training:

Flexibility: The quantization precisions for different layers of different networks are automatically adapted to guarantee the network accuracy.

Efficiency: We quantify both the backwardpass and forwardpass with fixedpoint numbers in training, which can accelerate training on real hardware. After training, int8 weights can be directly deployed, so no further quantification is needed.

Generalization: Evaluations on various networks and applications demonstrate the proposed adaptive precision fixedpoint training is effective and practical.
2 Related Works
Using reduced precision for deep learning has been an active research topic. Prior efforts explore floatingpoints(e.g., 8bit and 16bit) for training [34, 22] and maintain accuracy on a spectrum of deep learning models and datasets. However, as floatingpoint is more resourceintensive than fixedpoint, the deployments always rely on quantization techniques.
A branch of work explores the fixedpoint for forward prorogation(FPROP) [16, 17, 6, 33, 35, 35, 37]. The weights and activations are quantified to 18 bits. However, the backwardpass, including gradient propagation (BPROP), and weight gradient computation (WTGRAD) still require float32.
There are recent attempts quantifying weight and activation on different layers with different bitwidths. For the inference of a trained network, there are some techniques that heuristically search the space of quantization bitwidth combinations
[35, 33, 37]. However, these inference techniques only need to consider single iteration, whose search space is much smaller than training. Hence, they are unsuitable for training. For training, some differentiable quantization methods [4, 30, 37] learn the quantization parameters (e.g., step size, dynamic range and bitwidth) with gradient descent. However, the quantization parameters for backward propagation are hard to learn using differentiable methods. [26] quantifies the backward propagation. Different from their method, which assigns layerwise bitwidth before training, our approach dynamically changes the bitwidth during training and we evaluate on widely used networks.Researchers have shown that 16bit is sufficient for back propagation in most vision training tasks [7]. However, further quantization to 8bit results in severe degradation [38, 38, 36, 7, 1]. WAGE [36] claims that first and last layers require higher precision. TBP [1] shows weight gradient computation (WTGRAD) needs more bits than gradient back propagation (BPROP).
Our approach is different from others in three aspects. First, fixedpoint is used in both forwardpass and backwardpass for training. Second, the quantization parameters for different layers are dynamically adapted to guarantee the accuracy. Lastly, we train a variety of vision and natural language processing applications on large scale dataset.
3 Observation
The key of fixedpoint training is to find proper quantization parameters that ensure the training accuracy. Therefore, we study the relationship between the everchanging data distribution of different layers and the training convergence.
Observation 1. Data distribution varies greatly between layers. Figure. 1(a)
depicts the distributions of activation gradients of different layers on AlexNet. The majority of activation gradients concentrate in areas close to zero, and have long tail distributions. Compared to convolution layers, the fully connected layers have larger variances. Figure.
1(b) shows the base2 logarithm of max absolute value of activation gradients on AlexNet, the max value on bottom layers (e.g., conv0, conv1, conv2) is smaller than the max value on upper layers(e.g., fc0, fc1, fc2). Intuitively, for those layers whose range of data is wide and distribution is centralized, higher quantization resolutions are demanded.Observation 2. Data range of each layer changes during training. Figure. 1(b)
shows the max absolute value of activation gradient evolution during training. At the early stage of training (less than 10,000 iterations, as shown on the left side of the red line), the data range changes rapidly, and after one or two epochs, the data range tends to be stable. This phenomenon suggests that when training from scratch, the quantization range should also be changed frequently within the initial epochs .
Observation 3. Data with large variance requires large bitwidth. Figure. 1(c) shows the convergence curves using different bitwidth of different layers. Float32 is the training convergence curve of using float32 for all the convolution and fully connected layers. After 5,000,000 iterations, the network’s top1 accuracy on ImageNet is 58.00%. Then, we quantify the activation gradients of conv1 to int8 and keep other layers float32. The training curve of conv1int8 is the same as float32 and the final top1 accuracy is 58.01%. However, when we quantify the activation gradients of fc2 to int8 and keep other layers float32 unchanged, the training convergence speed is significantly slower than float32, and within the first 5,000 iterations the training does not converge. The final top1 accuracy of fc2int8 is only 48.27%. When quantifying the activation gradients of fc2 to int12, the training convergence speed is faster than int8 but still slower than float32. The final top1 accuracy of fc2int12 is only 50.30%. Using int16 for the activation gradients of fc2, finally the training curve is the same as float32 with 58.28% top1 accuracy. In conclusion, int8 is enough to quantify the activation gradient of conv2, however, fc2 requires int16 to maintain the training accuracy. Together with the observation1, we find that data with large variance requires large bitwidth, thus the quantization parameters should be dynamically determined by the data distribution.
According to network initialization principle [10, 13]
, all network parameters are initialized as Gaussian distribution with variance relating to the hyperparameters of layers. Similar network initialization principle and similar learning algorithm ensure that our observations should be applicable on various network architectures.
4 Adaptive Precision Training
In this section, we introduce the adaptive precision training approach as shown in Figure. 3. In training, the main three computing units of single iteration include forwardpass (FPROP), backwardpass for gradient propagation (BPROP) and backwardpass for weight gradient computation (WTGRAD). The inputs of these three units include weight , activation and top layers’ activation gradient of linear layer . In adaptive precision training, we quantify these three inputs to fixedpoint numbers^{1}^{1}1The quantification method is described in Appendix.B . The quantification parameters, such as bitwidth and quantization resolution are automatically determined by the proposed Quantization Error Measurement (QEM) and Quantification Parameter Adjustment (QPA) .
In the following part of this section, we will introduce two main components QEM and QPA of our training approach. Algorithm. 1 describes the entire adaptive precision training algorithm. The output of QEM (denoted as ) serves as an explicit indicator for insufficiency of quantization resolution according to data distribution. QPA performs quantization parameter update and determines update frequency (denoted as ) according to the output of QEM.
4.1 Quantization Error Measurement
Based on the observation 1 and observation 3, we propose to adjust quantization parameters according to data distribution. The difference of mean before and after quantization is a good quantization error measurement, which indicates the change of data distribution and suggests the need for adjusting quantization resolution.
Intuitively, as shown in Figure. 4, the orange line and blue line represent two different data distributions. The quantization resolution is . Using certain quantization resolution, the distribution difference can be reflected by the difference of shadow areas. Specifically, the shadow area S1 is approximately equal to S2, but S3 is much larger than S4. Therefore, for blue one, the mean after quantization is much smaller than the original mean . The difference of mean before and after quantization reflects the connection between quantization resolution and data distribution.
Mathematically, assuming that the data is under Gaussian distribution , and data is quantified to . Considering the positive , the mean between is , and after quantification the mean is . The difference of mean before and after quantization is represented as . We use to approximate the local value between with , and assign , then we have:
(1) 
It is demonstrated that and (see Appendix A for details), so we have . When decreasing or increasing , the difference of mean will be reduced. Therefore, the difference of mean serves as an explicit indicator for adjusting quantization resolution (represented by ) according to data distribution (represented by ).
Equation. 2 is used in determining quantization parameters during training.
(2)  
Larger indicates the distribution has higher variance , so it is needed to decrease quantization resolution .
4.2 Quantification Parameter Adjustment
According to observation 2, we propose to automatically determine the quantization parameter based on the data evolution. Under the circumstance of fixedpoint representation, the quantization variables include data range, quantization resolution and bitwidth . These three variables are interdependent, as . Therefore, we use only two of them as quantization parameters (i.e., and ). The parameter adjustment process is triggered by insufficient quantization resolution and dramatic change of data range.
For insufficient quantization resolution, we use as indicator. When exceeds certain threshold , the quantization resolution is reduced by increasing bitwidth, as ,where is the bitwidth growth step. We can either set the initial and recursively adjust bitwidth until proper (denoted as Mode1), or we can set the initial as the previous iteration’s proper bitwidth (denoted as Mode2). The quantization resolution is adjusted according to the new bitwidth as , where is the max absolute value of data to be quantified.
For the change of data range, we propose another indicator for iteration as:
(3) 
where is the moving average of data during several iterations.
The quantization parameter adjustment interval is automatically determined by both and . In initialization phase (onetenth of the first epoch), is set to 1. After initialization phase, the adjustment interval is , as and . As shown in experiment, increases during training. Within iterations, the quantization parameters are kept the same, so there is no need to calculate and max absolute value of the data.
5 Experiment
We first evaluate the proposed quantization error measurement, and show the computational complexity introduced by adaptive precision. Then, we evaluate the proposed adaptive precision training on a wide variety of deep learning tasks including image classification, object detection, segmentation and machine translation in accuracy results. At last, we show the training acceleration on existing hardware.
5.1 Evaluation of Error Measurement
We use Pearson correlation coefficient in Equation. 4 to show the correlation between network accuracy and quantization error metric .
(4) 
The evaluated quantization error metrics including the proposed and several variants: , , . M2 is similar as in [27, 39]
. M4 is the KullbackLeibler divergence, with
andare the discrete probability distributions of original data and data after quantization. Specifically, we quantify each single layer of MobileNetv2 and ResNet50 and do the forward propagation to get the corresponding network accuracy. The quantization is done with different bitwidth (i.e., 6, 8), so various degrees of quantization error and the corresponding network accuracy are generated.
shows the linear correlation between network accuracy and several error metrics. Our proposed quantization error measurement M1 has the highest correlation score (0.84 for MobileNet and 0.85 for ResNet50) with the networklevel accuracy, which means the proposed error measurement can serve as a reasonable layerwise accuracy indicator. MobileNet, as lightweight network, is hard to quantified as shown in Table.1, so it can exhibit the most noticeable difference between different evaluation metrics M1, M2, M3 and M4.
5.2 Computational Complexity
We evaluate the extra computations introduced by adaptive precision quantification. The extra computations including QEM, QPA and data quantification. Specifically, we calculate the operation percentage of forward propagation and backward propagation in original training, and the extra operation introduced by forward quantification and backward quantification. Figure. 7 shows the operation percentage for different networks^{2}^{2}2https://github.com/tensorflow/models/tree/master/research/slim. (Details of operation quantity are shown in Appendix.B.) It is shown that for lightweight network MobileNet, the quantization consumes relatively more computations. For other networks, the extra quantization computation is within 1%.
We evaluate the quantification parameter adjustment frequency during training. As shown in Figure. 7(a), at the initial epochs, the adjustment is triggered almost every iterations, so the adjustment frequency is near 100%. As the training progress, the adjustment frequency is dramatically decreasing, and at the end of training only 0.1% iterations need to adjust the quantization parameters.
Figure. 7(b) shows the percentage of activation gradients quantified to int8 during training on VGG16. Mode1 allows the chance of decreasing the bitwidth during training, so more percentage of layers are kept int8 (final top1 accuracy: 70.2%). In Mode2, bitwidth never decreases, so at the end of training, 18.75% of layers are kept int8 (final top1 accuracy: 70.6%).
5.3 Accuracy Results
Classification  float32  Adaptive  Activation Gradient  

Network  Acc  Acc  int8  int16 
AlexNet  58.0  58.22  22.5%  77.5% 
VGG16  71.0  70.6  31.3%  68.7% 
Inception_BN  73.0  72.8  4.5%  95.5% 
ResNet50  76.4  76.2  0.8%  99.2% 
ResNet152  78.8  78.2  1.7%  98.3% 
MobileNet v2  71.8  70.5  0.7%  99.2% 
SSD Detection  float32  Adaptive  Activation Gradient  
Network  mAP  mAP  int8  int16 
COCO_VGG  43.1  42.4  31.4%  68.6% 
VOC_VGG  77.3  77.2  34.3%  65.7% 
IMG_Res101  44.1  44.4  28.6%  71.4% 
Segmentation  float32  Adaptive  Activation Gradient  
Network  meanIoU  meanIoU  int8  int16 
deeplabv1  70.1  69.9  1.0%  99.0% 
Our proposed Adaptive Precision Training approach uses identical hyperparameters (e.g., learning rate, max training iterations) as the original float32 training settings. For all tasks, we fix the bitwidth for weights and activations (i.e., int8), and quantify activation gradients with adaptive bitwidth. For all the tasks, we set , , , , , and Mode2 is used in QPA.
5.3.1 Computer Vision
We train several convolution neural networks with ImageNet datasets using Tensorflow framework
^{3}^{3}3https://github.com/tensorpack/tensorpack/tree/master/examples/. The networks include AlexNet [18], VGG [28], Inception_BN [29], ResNet [14] and MobileNet v2 [27]^{4}^{4}4https://github.com/tensorflow/models/tree/master/research/slim. We train SSD object detection networks [21]^{5}^{5}5https://github.com/weiliu89/caffe/tree/ssd with VOC dataset [9], COCO dataset [20] and Imagnet Detection dataset (IMG) [25] upon two backbone networks VGG and ResNet101. We train deeplab [3]^{6}^{6}6https://github.com/msracver/DeformableConvNets segmentation network on VOC dataset. For classification task, Top1 Accuracy (Acc) is used as evaluation metric. For object detection task, Mean Average Precision (mAP) is used as evaluation metric. For segmentation task, Mean Intersection over Union (meanIoU) is used as evaluation metric.Methods  Backward  Adaptive  Training  Accuracy Degradation  

Cited  (WTGRAD/BPROP)  Bitwidth  from Scratch  CNN  RNN 
[34]  float8, float16  no  yes  %(ResNet50)  n/a 
[22]  float16  no  yes  %(ResNet50)  % (Translation) 
[16]  float32  no  no  1.5% (ResNet50)  n/a 
[17]  float32  no  no  %(ResNet18)  n/a 
[6]  float32  no  yes  %(ResNet50)  n/a 
[35]  float32  yes  no  %(ResNet18)  n/a 
[39]  float32  yes  no  %(ResNet50)  n/a 
[37]  float32  yes  yes  %(ResNet50)  n/a 
[38]  int8, float32  no  yes  2.9%(AlexNet)  n/a 
[36]  int8  no  yes  4%(AlexNet)  n/a 
[1]  int16, float32  no  yes  %(ResNet50)  n/a 
[7]  int16  no  yes  %(ResNet50)  2% (Translation) 
Adaptive Precision  int816 (CNN) int824 (RNN)  yes  yes  %(ResNet50)  % (Translation) 
As shown in Table. 1, Adaptive Precision Training generates similar results as float32 baseline. The accuracy drop on MobileNetv2 is consistent with the quantization results in Google’s work (Acc:70.8) [16]. However, using our adaptive precision training, int8 weights can be directly deployed and no further quantified finetuning is needed. The proposed QPE and QPA automatically change the bitwidth used for different layers. During the whole training, the percentages of different bitwidth in quantization of activation gradients are shown in Table. 1^{7}^{7}7This is the results of Mode2, as Mode2 generates slightly better results than Mode1, as shown in Figure. 8(b). For most layers of most networks, 16bit is enough. For some layers of AlexNet and SSD, 8bit is enough.
5.3.2 Machine Translation
We train two widely used machine translation models from scratch with Adam optimizer. The first Sockeye [15] is a sequencetosequence RNN model implemented with MXNet [5]^{8}^{8}8https://github.com/awslabs/sockeye, and trained on the WMT’17 news translation dataset (50k sentence pairs). The word vocabularies contain 50K entries for English and German. The second is Transformer [32]^{9}^{9}9https://github.com/jadore801120/attentionisallyouneedpytorch, utilizing selfattention mechanism. This network is trained on the WMT’16 Multi30k dataset (3.9k sentence pairs). Wordlevel accuracy and perplexity (PPL) are used as evaluation metrics.
Training curve of Sockeye is shown in Figure. 8(a). Adaptive Precision Training is compared with float32 baseline and an int16 method, which employs int16 to quantified all the layers of activation gradients without bitwidth adaption. At the end of Adaptive Precision Training, 0.8% layers of activation gradients are quantified to int24, 10% layers are int8, and others are int16. As shown in Figure. 8(a), the int16 method gradually results in 2% loss of accuracy, while our Adaptive Precision generates the same accuracy (62.05%) as float 32 baseline (61.97%). This comparison shows the proposed bitwidth adaption is necessary to guarantee training accuracy and reduces the total bitwidth in computation.
The training convergence curve of Transformer is shown in Figure. 8(b). We report the accuracy and PPL on validation set. Adaptive Precision (ACC: 55.54%) is slightly better than float32 (ACC: 54.13%). On average 2.28% of iterations trigger quantization parameter adjustment.
5.3.3 Comparison to Others
Table. 2 shows the comparison to other quantization methods. As the accuracy of float32 baseline are different across works, we cite their relative accuracy degradation compared to their reported float32 baseline. Most works do not quantify backwardpass and are tested on convolution neural networks. Among these, [7]
is the most similar method. They use int16 for both forward and backward propagation, and report results on convolution networks. Differently, we use int8 for all the forwardpass and demonstrate that for recurrent neural networks, the fixed bitwidth (e.g., int16) can not meet the precision requirement for all the tasks. Therefore, it is needed to dynamically measure the bitwidth requirement for different networks and tasks.
conv0  conv1  conv2  conv3  conv4  
CPU Forward  2.03  3.89  6.2  4.44  4.28 
CPU Backward  1.91  1.71  1.78  2.21  2.07 
GPU Forward  2.82  3.63  2.97  3.01  2.72 
fc0  fc1  fc2  Overall  
CPU Forward  4.09  6.42  4.41  3.98  
CPU Backward  4.41  4.97  2.03  2.07  
GPU Forward  3.09  2.55  1.41  2.89 
6 Training Acceleration
Intel Xeon Gold 6154 supports vector int8/int16 operations with AXV2 instruction set, and Nvidia T4 supports vector int8 operations. Table.
3 shows the speedup of our method compared with float32 in training. Specifically, we use 100 iterations’ average acceleration ratio of each layer in forwardpass and backwardpass for AlexNet with 256 batch size^{10}^{10}10As T4 does not support int16, we only report the forwardpass use int8 operation. Xeon Gold 6154 can only support multiplication between equal bitwidth fixedpoint numbers, so in this experiment int16 int8 is implemented as int16 int16.. Our approach can achieve 2.52 times speedup over float32 training on CPU, and 2.89 times speedup on GPU. Figure. 10 shows the details of running time for different type of convolution scale with different operation times. Using fixedpoint the computation time is significantly shorter than float32, and the extra time introduced by QEM and QPA is relatively small.7 Conclusion and Future Work
We observe that the data distribution reflects the precision requirement to maintain training accuracy. Therefore, we propose an adaptive precision quantization approach, which automatically determines bitwidth layerwise. Quantifying back propagation in Neural Networks can further accelerate training on hardware supporting flexible bitwidth arithmetic operations. The proposed error measurement of quantization would also be extended to lowbit inference (e.g., binary or ternary), and gradient compression in the future.
References
 [1] (2018) Scalable methods for 8bit training of neural networks. In NeurIPS, pp. 5145–5153. Cited by: §2, Table 2.
 [2] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §1.
 [3] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §5.3.1.
 [4] (2019) Deep neural network quantization via layerwise optimization using limited training data. In AAAI, Cited by: §2.
 [5] (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Cited by: §5.3.2.
 [6] (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv:1805.06085. Cited by: §1, §2, Table 2.
 [7] (2018) Mixed precision training of convolutional neural networks using integer operations. In ICLR, Cited by: §1, §2, §5.3.3, Table 2.
 [8] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §1.
 [9] (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §5.3.1.

[10]
(2010)
Understanding the difficulty of training deep feedforward neural networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256. Cited by: §3.  [11] (2015) Deep learning with limited numerical precision. In ICML, pp. 1737–1746. Cited by: §1.
 [12] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §1.
 [13] (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In ICCV, pp. 1026–1034. Cited by: §3.
 [14] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §5.3.1.

[15]
(201712)
Sockeye: A Toolkit for Neural Machine Translation
. arXiv:1712.05690. External Links: 1712.05690, Link Cited by: §5.3.2.  [16] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In CVPR, pp. 2704–2713. Cited by: §1, §2, §5.3.1, Table 2.
 [17] (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR, pp. 4350–4359. Cited by: §2, Table 2.
 [18] (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §5.3.1.
 [19] (2016) Fixed point quantization of deep convolutional networks. In ICML, pp. 2849–2858. Cited by: §1.
 [20] (2014) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §5.3.1.
 [21] (2016) SSD: single shot multibox detector. In ICCV, pp. 21–37. Cited by: §1, §5.3.1.
 [22] (2018) Mixed precision training. ICLR. Cited by: §2, Table 2.
 [23] (2017) NVIDIA tesla v100 gpu architecture. Cited by: §1.
 [24] (2018) Lower numerical precision deep learning inference and training. Intel White Paper. Cited by: §1.
 [25] (2015) Imagenet large scale visual recognition challenge. IJCV 115 (3), pp. 211–252. Cited by: §5.3.1.

[26]
(2019)
Pertensor fixedpoint quantization of the backpropagation algorithm
. In ICLR, Cited by: §2.  [27] (2018) A quantizationfriendly separable convolution for mobilenets. In EMC2, pp. 14–18. Cited by: §1, §5.1, §5.3.1.
 [28] (2014) Very deep convolutional networks for largescale image recognition. arXiv:1409.1556. Cited by: §5.3.1.

[29]
(2016)
Rethinking the inception architecture for computer vision
. In CVPR, pp. 2818–2826. Cited by: §5.3.1.  [30] (2019) Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452. Cited by: §2.
 [31] (2018) Bismo: a scalable bitserial matrix multiplication overlay for reconfigurable computing. In FPL, pp. 307–3077. Cited by: §1.
 [32] (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §1, §5.3.2.
 [33] (2019) HAQ: hardwareaware automated quantization with mixed precision. In CVPR, pp. 8612–8620. Cited by: §2, §2.
 [34] (2018) Training deep neural networks with 8bit floating point numbers. In NeurIPS, pp. 7675–7684. Cited by: §2, Table 2.
 [35] (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv:1812.00090. Cited by: §1, §2, §2, Table 2.
 [36] (2018) Training and inference with integers in deep neural networks. In ICLR, Cited by: §1, §2, Table 2.
 [37] (201906) Quantization networks. In CVPR, Cited by: §2, §2, Table 2.
 [38] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160. Cited by: §1, §2, Table 2.
 [39] (2018) Adaptive quantization for deep neural network. In AAAI, Cited by: §5.1, Table 2.
8 Appendix A
Appendix B. Quantification Method
Quantization Function  Quantization Scale  Fixedpoint Range 

A fixedpoint number consists of a sign bit, bit integer, and a global quantization resolution relating to fixedpoint position . Before quantization, the maximum absolute data is . The representation data range, bitwidth and quantization resolution are interdependent, as . The quantization resolution is calculated as in Table. 4 column 2. Suppose is the floating point representation of and is the fixedpoint representation of , and is the approximation of , as , , the multiplication between numbers becomes:
(12) 
Appendix C. Observations on Other Network
As shown in Figure. 11, for ResNet34 int8 is enough to quantify the activation gradient of g3b2c2, g2b5c1 and g3b2c1, however, int8 for fc and conv0 either not converges or introduces accuracy drop, conv0 and fc have large variance. These observations are consistent with the observation on AlexNet. In conclusion, data with large variance requires large bitwidth, thus the quantization parameters should be dynamically determined by the data distribution.
Appendix D. Operation Quantity
AlexNet  ResNet50  MobileNetv2  VGG16  
Forward  3.78E+11  1.78E+12  1.54E+11  7.93E+12 
Forward Quantification  6.95E+08  1.01E+10  8.68E+09  1.24E+10 
Backward  1.78E+12  5.37E+12  4.41E+11  2.88E+13 
Backward Quantification  1.90E+09  3.39E+10  2.57E+10  4.70E+10 
Appendix E. Speedup over int16
There is 1.3 times speedup over int16 on CPU for AlexNet (1.13 times speedup for backward and 1.7 times speedup for forward). The int16 x int8 in our method is implemented as int16 x int16 on Xeon Gold 6154. With flexible arithmetic operations like int16 x int8 on future hardware, higher training speedup is promising.
Comments
There are no comments yet.