1 Introduction
Deep convolutional neural networks (DCNNs) have achieved remarkable success in many fields, such as computer vision, natural language processing, information retrieval, etc. However, training and deploying DCNNs usually require a large amount of computational cost and power consumption, which is greatly challenging the extensive applications in industry. As a result, many recent studies have been focusing on how to accelerate the inference of neural networks by fixedpoint quantization on weights or activations
[6, 3, 23, 30, 27, 62, 42, 53, 61, 55, 20], and design dedicated hardware utilizing the efficient integer arithmetic [17, 5, 26, 22]. The successful progress surprisingly shows that the bitwidth can be reduced to extremely low such as 4bit while bringing quite little hurt to the accuracy for inference [15, 57, 13].Besides inference, lowbit training can also promise considerable acceleration, which further quantizes gradients and utilizes lowbit efficient compute kernel for both the forward and backward propagation. As analyzed in [24], the computation of backward propagation occupies more time than that of forward propagation. So accelerating the training utilizing lowbit quantization has greater potential when considering the backward process. There has existed 16bit floatingpoint (FP16) training, which proves the feasibility of lowbit training [41, 9, 29]. But it is restricted to limited advanced GPUs based on Turing or Volta architecture. Compared with FP16, the 8bit integer (INT8) operation is widely supported by general GPUs based on Turing, Volta and even lowend Pascal architectures. Besides, the 8bit integer arithmetic is theoretically and practically 2 faster than FP16 and 4 faster than FP32. Therefore, INT8 training enjoys better efficiency, lower power consumption and better versatility on offtheshelf hardware.
Despite the attractive benefits, when quantizing gradients to 8bit, the normal training tends to become unstable, since the distortion of gradients easily misleads the direction of training and causes crash of optimization. This definitely makes INT8 training very difficult, especially for the deep networks. Currently only a few studies have attempted to solve this problem [62, 56, 58, 2, 54, 50]. Unfortunately, all of them just tested limited quantizationfriendly networks with high redundancy, and usually require complex structure adjustment or introduce additional operation to reduce quantization error, while significantly increasing the computational complexity. Besides, most of these works lack the theoretical analysis on the adhoc tricks, and even worse, none of them reports the practical speedup in the realworld case. All these reasons make the existing INT8 training methods stay far away from the practicality without the universal design.
To build a robust and unified INT8 training framework, we conduct deeper explorations in the challenges of gradient quantization. We empirically find that the distribution of gradients owns four special characteristics: sharp and wide, evolutionary, depthspecific and structurespecific. These unique characteristics make gradient quantization quite different from the naive quantization on weights or activations, and INT8 training more difficult to be stabilized. It is important to understand the behaviors and effects of quantized gradient in the convergence of the training. Therefore, we theoretically establish the convergence bound with respect to the gradient quantization error and the learning rate.
Based on the special characteristics and the theoretical analysis, we propose two universal techniques: Direction Sensitive Gradient Clipping and Deviation Counteractive Learning Rate Scaling to stabilize the INT8 training. The Direction Sensitive Gradient Clipping minimizes the direction deviation by pursuing an appropriate clipping as the training process evolves. Sometimes even if the clipping helps reduce the quantization error, it may still suffer from the accumulated gradient deviations across deep layers. To eliminate this effect, the Deviation Counteractive Learning Rate Scaling is further devised to promise stable parameter updating. The fundamental idea of our method is shown in Figure 1. Extensive experiments on a variety of network structures and tasks prove the superiority and versatility of our method.
Our contribution can be summarized as below:

We observe four special characteristics on the gradient distribution: sharp and wide, evolutionary, depthspecific and structurespecific, which cause the larger quantization error of gradients.

We theoretically provide the convergence bound of INT8 training, and respectively devise two universal techniques that can stabilize the INT8 training.

We are the first to achieve stable INT8 training of various networks such as MobileNetV2/InceptionV3 and various tasks such as object detection, with comparable accuracy to fullprecision training.

We build a flexible and unified INT8 training framework for various tasks using various networks, which can easily replace the original fullprecision training.

We are the first to complete practical acceleration of INT8 training on lowend GPUs with Pascal architecture, i.e., NVIDIA GeForce GTX 1080Ti, achieving about 22% speedup without too much code optimization.
2 Related Work
Compared to huge amount of studies on accelerating inference by model quantization [47, 60, 7, 53, 11, 40], there are few works exploring quantized training including backward propagation comprehensively. DoReFaNet [62] quantizes gradients to 4 and 6 bits, but only experiments AlexNet with low precision gradient. WAGE [56] and WAGEUBN [58] quantize gradient to 8bit integer, but they both incur considerable loss of accuracy (greater than ). RangeBN [2] and FP8 training [54] achieve accuracy comparable to fullprecision models, but they both use floatingpoint number in gradients, which is not beneficial for hardware optimization to boost the speed. Besides quantized training, most lowprecision training research keeps gradient precision in 16bit floatingpoint. Flexpoint [29], MPT [41] and DFP [9] all use 16bit floatingpoint to train DNNs with accuracy comparable to fullprecision model. To perform more efficient training of neural networks, INT8 training has more advantages over FP16 training.
3 Unified INT8 Training
In this paper, we aim to build a unified INT8 training framework, which utilizes 8bit integer arithmetic to accelerate the expensive training process of deep neural networks including both the forward and backward propagation.
3.1 Preliminaries
Symmetric uniform quantization is the most efficient scheme among existed quantization methods, due to its hardwarefriendly computation. Therefore, to guarantee the acceleration performance, we build the INT8 training framework based on it. Given the data (i.e., weights, activations, and gradients) following in the range and a clipping value , the symmetric uniform quantization can be formulated as:
(1) 
where , indicates the scaling factor to project the floatingpoint number to fixedpoint 8bit integer, and represents the quantized fixedpoint number. Subsequently, the corresponding dequantized data can be calculated by:
(2) 
Different from most prior studies that mainly focus on speeding up the inference (i.e., the forward propagation), our INT8 training framework attempts to further accelerate the backward propagation during the training stage, by applying quantization to the gradients. Namely, we pursue the quantizedequantized gradients from fullprecision gradients in a proper way.
To ensure the quantized gradients maintain an unbiased expectation compared with the original ones, we adopt the stochastic rounding following [16]:
(3) 
Unfortunately, although the stochastic rounding technique limits the quantization error to some extent from the statistical view, the perturbation for each training iteration is still inevitable and harmful for convergence, whose reasons will be discussed in the following section.
3.2 Challenges of Gradient Quantization
Gradients determine the direction of optimization and the magnitude of parameter update and thus play a critical role in pursuing high accurate models. In INT8 training, after we apply quantization to gradients, the perturbation introduces deviation to the optimization direction. Once the deviation accumulates to an unacceptable degree, the training process may be unstable and even crash, resulting in severe performance degradation. Figure 2 shows our empirical observation that for some special network architectures like MobileNetV2, directly quantizing gradients causes a rapid crash of training.
To further investigate the essential reasons behind this phenomenon, we conduct detailed analysis on the distribution of gradients during training without gradient quantization, as shown in Figure 3. We surprisingly observe that the gradients own the following unique characteristics:

Sharp and Wide. As shown in Figure 3(a), compared to weights and activations, gradients follow an unusual distribution that has more values concentrated around zero while a certain number of extreme values also exists. Therefore, the distribution curve is very sharp with small values taking the majority of gradients, but the range is relatively very wide. This makes many gradients quantized to zero and the quantization error significantly large when using uniform quantization.

Evolutionary. Figure 3(b) depicts how the gradient distribution of the same layer evolves with respect to the training iterations. We can find that as the training goes on, the shape of gradient distribution becomes much sharper and narrower, which means it is impossible to fix the quantization settings throughout the training process, as we usually do for weights and activations, such as assuming the same clipping range in the whole training.

DepthSpecific. Figure 3(c) compares the distribution of gradients in different layers. It is obvious that the distributions in the shallow layers are sharper with larger extreme values than the deeper layers. This means that the preceding layers of the deep neural networks often face more severe quantization loss.

StructureSpecific. As can be seen in Figure 3(d), the gradients of layers with different structures present apparently different patterns. For MobileNetV2, the second convolutional layer in each block is of depthwise structure. Its gradients own larger range and sharper shape even in the deeper block, making MobileNetV2 harder to quantize from the aspect of gradients.
Based on the above observations, we can conclude that the gradients differ from weights and activations largely, which inevitably causes an unstable training, when simply adopting the common quantization techniques for weights and activations. This means that we need certain techniques to take care of distinctiveness in gradient quantization, which brings great challenges to the real and unified INT8 training in practice.
Before turning to devise the desired techniques considering the speciality of gradients, we first attempt to understand the gradient’s effect on the training stability, by theoretically revealing the connections between training convergence and gradient quantization. This will provide us a reliable clue to build the robust and unified INT8 training framework.
3.3 Stabilize Training: A Theoretical Perspective
As commonly used in the analysis of deep learning optimizers
[12, 28, 48, 39], the ability of convergence is usually evaluated by the regret .(4) 
where indicates the number of iterations. is the parameter at time in the convex compact set , and
denotes the corresponding loss function. The optimal parameter is represented by
. If the average regret approaches zero quickly as increases, the speed and ability of convergence can be guaranteed.Due to the complexity of the DCNNs, it is very difficult to directly analyze its behaviors. As the prior studies [1, 34, 21, 59] do, we first make the following assumptions:
Assumption 1.
is convex;
Assumption 2.
.
Although the convexity assumption may not hold for deep networks, analysis based on this can provide reasonable and valuable insights for us, which has been proved in previous studies [12, 39, 21, 59].
Taking the standard stochastic gradient descent algorithm into consideration, the optimization based on quantized gradient
and learning rate can be formulated as:(5) 
Then we have the following theoretical finding (see the supplementary materials for detailed proof):
Theorem 1.
We can find that the bound of average regret is dominated by three terms. Term (1) approaches zero as increases and thus can be ignored in gradient quantization. Term (2) indicates the quantization error of gradients greatly affects the ability to converge, and it is usually large, as analyzed in Section 3.2. For term (3), its magnitude is mainly influenced by the learning rate and l2norm of quantized gradients. Based on the theoretical analysis, to stabilize INT8 training, we have two basic principles for designing better quantization techniques: (1) reduce the quantization error of gradients; (2) scale down the learning rate. They are also very intuitive since, on the one hand, a lower quantization error means small deviation of optimization direction and thus avoids the training crash, on the other hand, it is a common sense that decreasing the learning rate gradually promises a better solution in the optimization.
Now with the design principles, the question is how to devise the universal techniques for INT8 training, meanwhile take the characteristics of gradients into consideration. We respectively present two novel techniques: Direction Sensitive Gradient Clipping and Deviation Counteractive Learning Rate Scaling, which together lower the average regret bound and guarantee stable INT8 training.
3.4 Direction Sensitive Gradient Clipping
Considering the basic operation in deep neural networks, the gradients of weights actually can be calculated by . From this aspect, the quantization error of in (6) mainly stems from that of activation gradients . Therefore, in our INT8 training we can mainly concern the quantization of , which will help control the error of quantized gradients in (6). For simplicity of notations, in the following discussion we directly use to denote .
To minimize quantization error, previous works mainly seek the optimal clipping value in (1
) by assuming certain data distribution, e.g. Gaussian distribution
[3, 4, 19, 2, 21, 11]. However, according to the gradient characteristics C1 and C2 we discover, it is unpractical to make a common assumption for an evolutionary and unusual gradient distribution. To further prove this point, we do the Kolmogorov–Smirnov test with distribution parameter solved by maximum likelihood estimation, and report the KSstatistics that consistently reject the assumption that gradients obey any common distribution in Table
1.Data  Distribution  Critical value  

Gaussian  Laplace  Student  
layer0  0.1934  0.0790  0.2005  0.0012  
0.0391  0.0721  0.1011  0.0765  
layer8  0.2061  0.1091  0.2303  0.0024  
0.0294  0.0569  0.1084  0.0110 
To find the optimal clipping value without any assumption, a straightforward idea is to keep the quantized gradient consistent with the original one by gradient descent algorithm. Usually, one can model the consistency using the popular meansquare error (MSE). Unfortunately, due to characteristics C2 and C3 of gradients with huge discrepancy and fluctuation in their magnitudes, MSE makes the optimization vulnerable and unable to work under the same simple setting across various layers.
Therefore, to pursue the desired clipping values of different layers that promise stable training, we choose cosine distance to guide the learning of clipping values, which not only avoids the negative effect of the varied gradients’ magnitudes, but also keeps the network optimization directions consistent:
(7) 
where and denote the original floatingpoint gradient and its quantizedequantized counterpart.
The cosine distance measures the direction deviation of quantized gradients. As shown in Figure 4, when increases to a certain level, the whole training crashes. There exists strong correlation between and training stability, which proves that cosine distance can effectively reflect the influence of gradient quantization on the convergence. By minimizing the deviation, we subsequently reduce term (2) in (6). Figure 5(a) shows the quantization error using different clipping values, where there exists an optimal clipping value that substantially reduces the cosine distance.
3.5 Deviation Counteractive Learning Rate Scaling
The theoretical analysis on convergence ability of quantized training indicates the necessity of scaling down learning rate, since the quantization error of gradients cannot vanish completely. To validate this point, we decrease the learning rate of the original crashed training of MobileNetV2 mentioned in Section 3.2 and find that it defers and even eliminates the crash with an extremely low learning rate, although facing a performance degradation (see the red, green and orange lines in Figure 5(b)).
Since the gradients are backward propagated layer by layer, the minor gradient deviation will accumulate exponentially after massive multiplication and addition calculation. To address this issue, we further propose the Deviation Counteractive Learning Rate Scaling to balance out the error by exponentially decaying the learning rate according to the degree of direction deviation , the scaling function is formulated at:
(8) 
where controls the decay degree and limits the lower bound of scaling.
This scaling function generates a factor to scale down the original fullprecision learning rate. We empirically find that the selfadapting scaling function performs well in a layerwise way, adaptively adjusting the learning rate according to the direction deviations in different layers. This counteracts the undesired effects of the gradient deviations across layers, and exactly addresses the challenges of the depthspecific and structurespecific patterns as observed in characteristics C3 and C4 in Section 3.2. The blue line in Figure 5(b) demonstrates that the training equipped with scaling achieves higher accuracy than the manually adjusted ones (tested with MobileNetV2 on CIFAR10).
3.6 General Purpose Training Framework
Period  1  10  100  1000 

Average time(s/iter)  1.006  0.364  0.301  0.297 
In addition to ensuring the stable and accurate convergence, in practice our unified INT8 training framework should also satisfy the following three features:
(1) Easy to plug into any DCNN architecture.
To realize this, we implement an automatic match and replacement mechanism in PyTorch
[46] that correspondingly substitutes convolutional and fullyconnected layers with 8bit counterpart. The whole workflow including both forward and backward passes is shown in Figure 6.(2) No excessive extra computational overhead. To avoid the extra time cost of calculating clipping value, we design a Periodic Update method to optimize the clipping value periodically. As we can see in Table 2, the Periodic Update method dramatically reduces the computational overhead of optimizing the clipping value.
(3) Easy to implement on offtheshelf hardware.
To validate the potential of that, we utilizes the DP4A instruction (8bit integer 4element vector dot product) on lowend NVIDIA Pascal GPUs to implement efficient 8bit kernels for calculating gradients. To the best of our knowledge, we are the first to achieve practical acceleration of INT8 training including the backward propagation. The detailed speedup will be reported and discussed in Section
4.4.4 Experiments
We conduct extensive experiments to demonstrate that our proposed framework is unified for various network structures on popular image classification and object detection tasks with stateoftheart accuracy, and meanwhile it can be easily deployed on the mainstream devices (NVIDIA Pascal GPU) with satisfactory speedup, compared to fullprecision training.
4.1 Ablation Study
Settings. We first conduct the ablation study on CIFAR10 dataset with MobileNetV2 [51], to validate the effectiveness of the proposed techniques. We use cosine scheduler [1] with initial learning rate set to 0.1 for all experiments. In the Periodic Update experiment, the and in learning rate scaling are set to 20 and 0.1 respectively.
Direction Sensitive Gradient Clipping. Figure 7(a) shows the cosine distance with respect to the training steps. We can observe that conv2 (the second convolutional layer) of each block owns a much larger cosine distance than other layers of the block most of the time. This is consistent with C4 that the gradients of conv2 own sharper shape, indicating that our cosine distance can well reflect the gradient characteristics.
Moreover, as Table 3 lists, our proposed direction sensitive gradient clipping technique indeed prevents INT8 training from crashing, which proves the fact that optimizing a clipping value of gradients to minimize direction deviation can certainly ensure a stable INT8 training.
Clipping method  No clipping 



Accuracy (%)  NaN  93.02 
Deviation Counteractive Learning Rate Scaling. We evaluate three forms of learning rate scaling strategies without clipping to control variable for a reasonable comparison. The results shown in Figure 7
(b) reveal that linear and quadratic forms are too weak to control optimization direction within the convergence boundary and model crashes in the training process. Compared with linear and quadratic form, the scaling with exponential form is more powerful to counteract the direction deviation and prevents optimization from stepping out of the convergence boundary. We further explore its sensitivity to the selection of hyperparameter in Table
4, and we can see that different settings of and achieve similar accuracy, which presents the stability of our Deviation Counteractive Learning Rate Scaling.10  10  20  20  
0.1  0.2  0.1  0.2  
Accuracy (%)  92.82  93.28  93.38  93.27 
Periodic Update for clipping value. To reduce the extra computational overhead, we increase the period to update clipping value and find that it brings little hurt to the accuracy, as shown in Table 5. This empirical conclusion brings possibilities for the practical acceleration of INT8 training. Besides, here we apply both gradient clipping and learning rate scaling, and obtain better performance (see that with period 1) than those in Table 3 and 4. This further verifies the positive effects of the two general techniques.
Period  1  10  100  1000 

Accuracy (%)  93.66  93.07  93.38  92.75 
4.2 Image Classification
Now we consider the popular image classification task that most prior studies choose to evaluate the quantization performance. We experiment with AlexNet [32], ResNet [18], MobileNetV2 [51] and InceptionV3 [52] on CIFAR10 [31]
and ImageNet (ILSVRC2012)
[10]. The CIFAR10 dataset contains a training set of 50K images and a testing set of 10k images. Each image is of size 33 with 10 classes. ImageNet (ILSVRC2012) consists of 1.2 million training images and 50K test images with 1000 classes.Settings. As for the hyperparameters of ResNet, we use the same settings described in [18]. For other neural networks, we use cosine scheduler [1] with initial learning rate set to 0.1. The and in learning rate scaling are set to 20 and 0.1 respectively. Clipping value is updated per 100 iterations for all experiments.
CIFAR10. As Table 6 shows, our method achieves comparable accuracy on ResNet20 to FP8 training, but takes much less memory and computation consumption due to the fixedpoint operation. Moreover, our method performs surprisingly good on MobileNetV2 (1.01 accuracy drop) and InceptionV3 (even better than full precision model).
ImageNet. Table 7 lists existing stateoftheart quantized training methods including WAGE [56], WAGEUBN [58] and FP8 training [54]. For AlexNet INT8 training, our method obtains 5.84% improvement over DoReFaNet [62]. Free from the extra overhead like , our method enjoys higher efficiency than DoReFaNet. As for the 2bit weight and 8bit activation/gradient case, we significantly outperform WAGE with about 3% accuracy gain. What’s more, equipped with our method, the INT8 training for ResNet architecture achieves almost no performance degradation, while none of the previous studies has done that. Compared with the FP8 training method, our method improves the accuracy by nearly 3%. It should be noted that we can directly get a real speedup on popular offtheshelf devices while methods like FP8 training need specially designed hardware, which means that our framework is more general for unified training acceleration.
As analyzed in [36]
, the convolutional layer occupies most of the training time while other layers like BatchNorm and ReLU are not computationintensive. Therefore, we mainly focus on quantizing convolutional layers currently and do not quantize BatchNorm layer like RangeBN
[2] and WAGEUBN [58]. Even so, there is still a significant speedup for INT8 training. In addition, we could get comparable accuracy to full precision training, much higher than RangeBN and WAGEUBN.Networks using INT8 training for the first time. To our best knowledge, we are the first to quantize gradient of MobileNetV2, which is known to be difficult in this community. Our method gets very good performance on both CIFAR10 and ImageNet datasets using MobileNetV2, with only around 1 accuracy loss. We also try INT8 training on InceptionV3 for the first time, and achieve comparable accuracy to full precision model. Note that for InveptionV3 on CIFAR10, our INT8 training method can even achieve better performance than the fullprecision model.
Model  Method 




ResNet20  FP  32/32/32  92.32  
FP8 training [54]  8/8/8  92.21  
Ours  8/8/8  91.95  
MobileNetV2  FP  32/32/32  94.39  
Ours  8/8/8  93.38  
InceptionV3  FP  32/32/32  94.89  
Ours  8/8/8  95.00 
Model  Method 




AlexNet  FP  32/32/32  59.84  
DoReFaNet [62]  8/8/8  53.00  
Ours  8/8/8  58.84  
WAGE [56]  2/8/8  48.40  
Ours  2/8/8  51.28  
ResNet18  FP  32/32/32  70.30  
WAGEUBN [58]  8/8/8  66.92  
FP8 training [54]  8/8/8  67.34  
Ours  8/8/8  69.67  
ResNet34  FP  32/32/32  73.68  
WAGEUBN [58]  8/8/8  68.50  
Ours  8/8/8  73.29  
ResNet50  FP  32/32/32  76.60  
WAGEUBN [58]  8/8/8  69.07  
Ours  8/8/8  76.34  
MobileNetV2  FP  32/32/32  72.39  
Ours  8/8/8  71.20  
InceptionV3  FP  32/32/32  72.39  
Ours  8/8/8  71.20 
4.3 Object Detection
To prove the versatility of our method, we further conduct experiments with the popular object detection networks including FasterRCNN [49], RFCN [8] and RetinaNet [37] on two widely used datasets: PASCAL VOC [14] and COCO [38]. The PASCAL VOC dataset consists of 11k images with 20 classes. The COCO dataset contains more than 20k images and 80 object categories. Note that we are the first to successfully achieve INT8 training on the object detection task.
Settings. As for the hyperparameters, we follow the same rules described in [35]. And and for learning rate scaling are the same as those used in image classification task.
PASCAL VOC. We test RFCN and Faster RCNN with different backbones, and find that quantized training equipped with our method only suffers a very slight detection accuracy (mAP) drop. The result of RFCN shows that even for a deeper backbone such as ResNet101, our INT8 training still maintains almost the same accuracy as fullprecision.
COCO. On the large scale COCO dataset, we experiment with RetinaNet (onestage) and Faster RCNN (twostage). Our method performs stably with less than 1.8 accuracy degradation on both networks. We find that RetinaNet incurs higher mAP loss than Faster RCNN, which is inconsistent with the conclusions in the previous study [35]. This may be caused by the fact that the focal loss used in one stage detector is more sensitive to gradient quantization.
Model  Backbone  Method 

mAP (%)  

Faster RCNN  ResNet50  FP  32/32/32  82.0  
ResNet50  Ours  8/8/8  81.9  
RFCN  ResNet101  FP  32/32/32  80.8  
ResNet101  Ours  8/8/8  79.1 
Model  Backbone  Method 

mAP (%)  

Faster RCNN  ResNet50  FP  32/32/32  36.2  
ResNet50  Ours  8/8/8  34.95  
RetinaNet  ResNet50  FP  32/32/32  36.9  
ResNet50  Ours  8/8/8  35.1 
Precision  Forward (s)  Backward (s)  Iteration (s) 

FP32 (cuDNN)  0.117  0.221  0.360 
INT8 (ours)  0.101  0.171  0.293 
4.4 Speed Result on NVIDIA GPU
None of the existing libraries can directly support the complete INT8 training. Thus we implement it by ourselves on NVIDIA Pascal GPU using DP4A instruction to verify the acceleration power of our method. The speedup of each convolutional layer in ResNet50 is shown in Figure 8. In the forward process using our solution, INT8 can bring an average 1.63 speedup, while in the backward process, it can achieve a higher 1.94 speedup. Table 10 further reports the time consumption and speed improvement of each training round. Even if we only replace the FP32 convolutional layer with the slightly optimized INT8 one, the training time for ResNet50 can be reduced by about 22%.
5 Conclusions
In this paper, we attempt to build an INT8 training framework for common DCNNs. We found four distinctive characteristics of gradients and then gave two theoretical principles stabilizing training with the convergence bound. Based on that, we proposed Direction Sensitive Gradient Clipping and Deviation Counteractive Learning Rate Scaling. Extensive experiments prove the versatility of our method for various networks and tasks. We reduced the training time by 22% on Pascal GPU with only trivial optimization. If each layer is sufficiently optimized, the training will achieve higher speedup and lower memory consumption. We hope our first successful attempt can help lead the community towards a fully unified INT8 training.
References
 [1] (2017) QSGD: communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30, pp. 1709–1720. Cited by: §3.3, §4.1, §4.2.
 [2] (2018) Scalable methods for 8bit training of neural networks. External Links: 1805.11046 Cited by: §1, §2, §3.4, §4.2.
 [3] (2018) Posttraining 4bit quantization of convolution networks for rapiddeployment. arXiv preprint arXiv:1810.05723. Cited by: §1, §3.4.
 [4] (201707) Deep learning with low precision by halfwave gaussian quantization. In CVPR, Cited by: §3.4.

[5]
(2019)
An instruction set architecture for machine learning
. ACM Transactions on Computer Systems (TOCS) 36 (3), pp. 9. Cited by: §1.  [6] (2018) PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1.
 [7] (2015) BinaryConnect: training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363. Cited by: §2.
 [8] (201612) Rfcn: object detection via regionbased fully convolutional networks. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Cited by: §4.3.
 [9] (201805) Mixed precision training of convolutional neural networks using integer operations. In ICLR, Cited by: §1, §2.

[10]
(200907)
ImageNet: A LargeScale Hierarchical Image Database.
2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. Cited by: §4.2. 
[11]
(201906)
Regularizing activation distribution for training binarized deep networks
. In CVPR, Cited by: §2, §3.4.  [12] (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §3.3, §3.3.
 [13] (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: §1.
 [14] (201006) The pascal visual object classes (voc) challenge. Int. J. Comput. Vision. Cited by: §4.3.
 [15] (201910) Differentiable soft quantization: bridging fullprecision and lowbit neural networks. In ICCV, Cited by: §1.
 [16] (201507–09 Jul) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1737–1746. Cited by: §3.1.
 [17] (201606) EIE. ACM SIGARCH Computer Architecture News 44 (3), pp. 243–254. External Links: ISSN 01635964, Link, Document Cited by: §1.
 [18] (201606) Deep residual learning for image recognition. CVPR. External Links: ISBN 9781467388511, Link, Document Cited by: §4.2, §4.2.
 [19] (201906) Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. In CVPR, Cited by: §3.4.
 [20] (201805) Lossaware weight quantization of deep networks. In ICLR, Cited by: §1.
 [21] (201905) Analysis of quantized models. In ICLR, Cited by: §3.3, §3.3, §3.4.
 [22] Ascend 310. Note: https://e.huawei.com/se/products/cloudcomputingdc/atlas/ascend310 Cited by: §1.
 [23] (201806) Quantization and training of neural networks for efficient integerarithmeticonly inference. 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781538664209, Link, Document Cited by: §1.
 [24] (2016) Cnnbenchmarks. GitHub. Note: https://github.com/jcjohnson/cnnbenchmarks Cited by: §1.
 [25] (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, New York, NY, USA. Cited by: §5, §6.3.1.

[26]
(2017)
Indatacenter performance analysis of a tensor processing unit
. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §1.  [27] (201906) Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR, Cited by: §1.
 [28] (201505) Adam: A method for stochastic optimization. In ICLR, Cited by: §3.3.
 [29] (201712) Flexpoint: an adaptive numerical format for efficient training of deep neural networks. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Cited by: §1, §2.
 [30] (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1.
 [31] (2014) The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, pp. 4. Cited by: §4.2.
 [32] (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12. Cited by: §4.2.
 [33] (2006) High performance convolutional neural networks for document processing.. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule, France. Cited by: §6.3.1.
 [34] (2017) Training quantized nets: a deeper understanding. In Advances in Neural Information Processing Systems 30, pp. 5811–5821. Cited by: §3.3.
 [35] (201906) Fully quantized network for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3, §4.3.
 [36] (2016) Performance analysis of gpubased convolutional neural networks. In 2016 45th International Conference on Parallel Processing (ICPP), pp. 67–76. Cited by: §4.2.
 [37] (201710) Focal loss for dense object detection. In ICCV, Cited by: §4.3.
 [38] (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), pp. 740–755. Cited by: §4.3.
 [39] (201905) Adaptive gradient methods with dynamic bound of learning rate. In ICLR, Cited by: §3.3, §3.3.
 [40] (2018) Discovering lowprecision networks close to fullprecision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191. Cited by: §2.
 [41] (201805) Mixed precision training. In ICLR, Cited by: §1, §2.
 [42] (2017) Apprentice: using knowledge distillation techniques to improve lowprecision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §1.
 [43] cuDNN Documentation. Note: https://docs.nvidia.com/deeplearning/sdk/cudnndeveloperguide/index.html Cited by: §6.3.1, §6.3.1.
 [44] PTX ISA. Note: https://docs.nvidia.com/cuda/parallelthreadexecution/ Cited by: §6.3.1, §6.3.1.
 [45] Random number generators: good ones are hard to find. Commun. ACM. Cited by: §6.3.2.
 [46] (2017) Automatic differentiation in pytorch. Cited by: §3.6.
 [47] (2016) XNORnet: imagenet classification using binary convolutional neural networks. Lecture Notes in Computer Science, pp. 525–542. External Links: ISBN 9783319464930, ISSN 16113349, Link, Document Cited by: §2.
 [48] (201805) On the convergence of adam and beyond. In ICLR, Cited by: §3.3.
 [49] (201512) Faster rcnn: towards realtime object detection with region proposal networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Cited by: §4.3.
 [50] (201905) Pertensor fixedpoint quantization of the backpropagation algorithm. In ICLR, Cited by: §1.
 [51] (201806) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.1, §4.2.
 [52] (201606) Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4.2.
 [53] (2018) HAQ: hardwareaware automated quantization. arXiv preprint arXiv:1811.08886. Cited by: §1, §2.
 [54] (201812) Training deep neural networks with 8bit floating point numbers. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, Cited by: §1, §2, §4.2, Table 6, Table 7.
 [55] (201806) Twostep quantization for lowbit neural networks. IEEE CVPR. Cited by: §1.
 [56] (201805) Training and inference with integers in deep neural networks. In ICLR, Cited by: §1, §2, §4.2, Table 7.
 [57] (201906) Quantization networks. In CVPR, Cited by: §1.
 [58] (2019) Training highperformance and largescale deep neural networks with full 8bit integers. External Links: 1909.02384 Cited by: §1, §2, §4.2, §4.2, Table 7.
 [59] (2019) Blended coarse gradient descent for full quantization of deep neural networks. Research in the Mathematical Sciences 6 (1), pp. 14. Cited by: §3.3, §3.3.
 [60] (201809) LQnets: learned quantization for highly accurate and compact deep neural networks. In ECCV, Cited by: §2.
 [61] (2017) Incremental network quantization: towards lossless cnns with lowprecision weights. arXiv preprint arXiv:1702.03044. Cited by: §1.
 [62] (2016) DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. External Links: Link Cited by: §1, §1, §2, §4.2, Table 7.
6 Supplementary Material
6.1 Proof of Theorem 1
Assumption 1.
is convex;
Assumption 2.
.
Proof.
Considering the update for th entry of weight,
(9) 
we have
(10) 
Rearrange the equation, and divide on both side as is nonezero,
(11) 
The error of quantized gradients is defined as
(12) 
Replace in the (11) with and , and we can get that
(13) 
According to assumption 1,
(14) 
So combine the (13) and (14), sum over the dimensions of and the iterations, then the regret
(15) 
Combine (15) with the assumption 2, and we can further relax the above (15) to
(16) 
Assume that all layers have the same learning rate, then
(17) 
Based on Cauchy’s inequality and assumption 2, we finally get
(18) 
Thus the average regret
(19) 
∎
6.2 INT8 Training Stability
We plot the accuracy and the loss curve of MobileNetV2 training on CIFAR10 dataset and ResNet50 training on ImageNet dataset to show the stability of INT8 training. From Figure 9 and Figure 10, we can see that our method makes INT8 training smooth and achieves accuracy comparable to FP32 training. The quantization noise increases exploratory ability of INT8 training since the quantization noise at early stage of training could make the optimization direction more diverse, and with properly reduced learning rate, INT8 training sometimes even converge faster than FP32 training.
6.3 INT8 Convolution Speed Up Algorithm
6.3.1 INT8 Convolution
On NVIDIA GPUs with Pascal architectures (such as GP102, GP104, and GP106), the new 8bit integer 4element dot product with accumulation (DP4A) [44] instruction is supported. This enables the NVIDIA GeForce GTX 1080Ti (based on GP102) to achieve a peak integer throughput of 44 Tera Operations Per Second (TOPS), while the peak float throughput is only 11 Tera Float Operations Per Second (TFLOPS).
Since the release of cuDNN 6.0 [43], INT8 inference is supported but the INT8 backward process is not implemented. So we use the DP4A instruction to implement the INT8 backward process by ourselves. Moreover, we find that the quantization process before INT8 convolution computation is pretty timeconsuming as the quantization needs to read and write the whole data. In order to reduce the overhead that quantization brings, we fuse the quantization process with the convolution computation (quantizationconvolution fused kernel). In Figure 11, we can see that the combination of quantization and convolution could avoid one extra global memory read and write effectively. Thus we rewrite the INT8 forward and backward process using this quantizationconvolution fused kernel and achieve a significant speedup.
In our implementation, we transpose the data layout into NC4HW so that we can use the DP4A instruction to conduct the convolution computation. We use the instruction in Parallel Thread Execution and Instruction Set Architecture (PTX ISA) [44] to transpose the data efficiently. This instruction picks four arbitrary bytes from two 32bit registers, and reassembles them into a 32bit destination register. Figure 12 shows that one thread can transpose data in 44 8bit integer block by using 12 instructions with shared memory. The transpose implementation code is listed below.
After transposition, we use two kinds of algorithms im2col plus GEMM [25, 33] and implicit GEMM [43] to implement convolution, and choose a faster algorithm for each convolution layer before training. Through these two algorithms, we convert the original convolution into dot product. Then we use one float load instruction to load four INT8 data and one DP4A instruction to compute four INT8 dot product operations. This can speed up the INT8 convolution significantly.
6.3.2 Stochastic Rounding
Due to the use of stochastic rounding in quantizing gradients, we need to generate uniform random numbers during the backward process. One way to generate random numbers is using , but this instruction needs extra global memory access, which will significantly degrade our INT8 convolution performance, with time consumption increasing over 100. Another method is to use , and we need to set a unique for each thread to get different random numbers, which requires a large amount of gpu memory. Worse still, this method runs as slow as the first method. Considering both disadvantages above, we use Linear Congruential Generator (LCG) [45] to yield a sequence of pseudorandomized numbers instead.
The generator is defined by recurrence relation,
(20) 
where is the sequence of pseudorandom values, is the modules, is the multiplier, is the increment, and is the random seed. The parameters , and are set to constants.
In order to get different random seeds in each thread, we set the random seed to first input data and add the thread index to . With above settings, each thread can get a unique random seed. The LCG method generates random numbers quickly and brings slight time consumption to INT8 convolution.
Comments
There are no comments yet.