Towards Unified INT8 Training for Convolutional Neural Network

12/29/2019 ∙ by Feng Zhu, et al. ∙ SenseTime Corporation Beihang University 21

Recently low-bit (e.g., 8-bit) network quantization has been extensively studied to accelerate the inference. Besides inference, low-bit training with quantized gradients can further bring more considerable acceleration, since the backward process is often computation-intensive. Unfortunately, the inappropriate quantization of backward propagation usually makes the training unstable and even crash. There lacks a successful unified low-bit training framework that can support diverse networks on various tasks. In this paper, we give an attempt to build a unified 8-bit (INT8) training framework for common convolutional neural networks from the aspects of both accuracy and speed. First, we empirically find the four distinctive characteristics of gradients, which provide us insightful clues for gradient quantization. Then, we theoretically give an in-depth analysis of the convergence bound and derive two principles for stable INT8 training. Finally, we propose two universal techniques, including Direction Sensitive Gradient Clipping that reduces the direction deviation of gradients and Deviation Counteractive Learning Rate Scaling that avoids illegal gradient update along the wrong direction. The experiments show that our unified solution promises accurate and efficient INT8 training for a variety of networks and tasks, including MobileNetV2, InceptionV3 and object detection that prior studies have never succeeded. Moreover, it enjoys a strong flexibility to run on off-the-shelf hardware, and reduces the training time by 22 effort. We believe that this pioneering study will help lead the community towards a fully unified INT8 training for convolutional neural networks.



There are no comments yet.


page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (DCNNs) have achieved remarkable success in many fields, such as computer vision, natural language processing, information retrieval, etc. However, training and deploying DCNNs usually require a large amount of computational cost and power consumption, which is greatly challenging the extensive applications in industry. As a result, many recent studies have been focusing on how to accelerate the inference of neural networks by fixed-point quantization on weights or activations 

[6, 3, 23, 30, 27, 62, 42, 53, 61, 55, 20], and design dedicated hardware utilizing the efficient integer arithmetic [17, 5, 26, 22]. The successful progress surprisingly shows that the bit-width can be reduced to extremely low such as 4-bit while bringing quite little hurt to the accuracy for inference [15, 57, 13].

Figure 1: The fundamental idea of our unified INT8 training. and represent the original float gradient and the quantized one, respectively. and represent different direction deviations that quantization brings. The red lines present crash cases when the direction deviation is large. The left subfigure indicates that clipping gradient properly to reduce direction deviation within the convergence boundary can avoid crash. The right subfigure points out that controlling learning rate (step size) could promise a stable parameter updating by counteracting negative effect of deviation.

Besides inference, low-bit training can also promise considerable acceleration, which further quantizes gradients and utilizes low-bit efficient compute kernel for both the forward and backward propagation. As analyzed in [24], the computation of backward propagation occupies more time than that of forward propagation. So accelerating the training utilizing low-bit quantization has greater potential when considering the backward process. There has existed 16-bit floating-point (FP16) training, which proves the feasibility of low-bit training [41, 9, 29]. But it is restricted to limited advanced GPUs based on Turing or Volta architecture. Compared with FP16, the 8-bit integer (INT8) operation is widely supported by general GPUs based on Turing, Volta and even low-end Pascal architectures. Besides, the 8-bit integer arithmetic is theoretically and practically 2 faster than FP16 and 4 faster than FP32. Therefore, INT8 training enjoys better efficiency, lower power consumption and better versatility on off-the-shelf hardware.

Despite the attractive benefits, when quantizing gradients to 8-bit, the normal training tends to become unstable, since the distortion of gradients easily misleads the direction of training and causes crash of optimization. This definitely makes INT8 training very difficult, especially for the deep networks. Currently only a few studies have attempted to solve this problem [62, 56, 58, 2, 54, 50]. Unfortunately, all of them just tested limited quantization-friendly networks with high redundancy, and usually require complex structure adjustment or introduce additional operation to reduce quantization error, while significantly increasing the computational complexity. Besides, most of these works lack the theoretical analysis on the ad-hoc tricks, and even worse, none of them reports the practical speedup in the real-world case. All these reasons make the existing INT8 training methods stay far away from the practicality without the universal design.

To build a robust and unified INT8 training framework, we conduct deeper explorations in the challenges of gradient quantization. We empirically find that the distribution of gradients owns four special characteristics: sharp and wide, evolutionary, depth-specific and structure-specific. These unique characteristics make gradient quantization quite different from the naive quantization on weights or activations, and INT8 training more difficult to be stabilized. It is important to understand the behaviors and effects of quantized gradient in the convergence of the training. Therefore, we theoretically establish the convergence bound with respect to the gradient quantization error and the learning rate.

Based on the special characteristics and the theoretical analysis, we propose two universal techniques: Direction Sensitive Gradient Clipping and Deviation Counteractive Learning Rate Scaling to stabilize the INT8 training. The Direction Sensitive Gradient Clipping minimizes the direction deviation by pursuing an appropriate clipping as the training process evolves. Sometimes even if the clipping helps reduce the quantization error, it may still suffer from the accumulated gradient deviations across deep layers. To eliminate this effect, the Deviation Counteractive Learning Rate Scaling is further devised to promise stable parameter updating. The fundamental idea of our method is shown in Figure 1. Extensive experiments on a variety of network structures and tasks prove the superiority and versatility of our method.

Our contribution can be summarized as below:

  • We observe four special characteristics on the gradient distribution: sharp and wide, evolutionary, depth-specific and structure-specific, which cause the larger quantization error of gradients.

  • We theoretically provide the convergence bound of INT8 training, and respectively devise two universal techniques that can stabilize the INT8 training.

  • We are the first to achieve stable INT8 training of various networks such as MobileNetV2/InceptionV3 and various tasks such as object detection, with comparable accuracy to full-precision training.

  • We build a flexible and unified INT8 training framework for various tasks using various networks, which can easily replace the original full-precision training.

  • We are the first to complete practical acceleration of INT8 training on low-end GPUs with Pascal architecture, i.e., NVIDIA GeForce GTX 1080Ti, achieving about 22% speedup without too much code optimization.

2 Related Work

Compared to huge amount of studies on accelerating inference by model quantization [47, 60, 7, 53, 11, 40], there are few works exploring quantized training including backward propagation comprehensively. DoReFa-Net [62] quantizes gradients to 4 and 6 bits, but only experiments AlexNet with low precision gradient. WAGE [56] and WAGEUBN [58] quantize gradient to 8-bit integer, but they both incur considerable loss of accuracy (greater than ). RangeBN [2] and FP8 training [54] achieve accuracy comparable to full-precision models, but they both use floating-point number in gradients, which is not beneficial for hardware optimization to boost the speed. Besides quantized training, most low-precision training research keeps gradient precision in 16-bit floating-point. Flexpoint [29], MPT [41] and DFP [9] all use 16-bit floating-point to train DNNs with accuracy comparable to full-precision model. To perform more efficient training of neural networks, INT8 training has more advantages over FP16 training.

3 Unified INT8 Training

In this paper, we aim to build a unified INT8 training framework, which utilizes 8-bit integer arithmetic to accelerate the expensive training process of deep neural networks including both the forward and backward propagation.

3.1 Preliminaries

Symmetric uniform quantization is the most efficient scheme among existed quantization methods, due to its hardware-friendly computation. Therefore, to guarantee the acceleration performance, we build the INT8 training framework based on it. Given the data (i.e., weights, activations, and gradients) following in the range and a clipping value , the symmetric uniform quantization can be formulated as:


where , indicates the scaling factor to project the floating-point number to fixed-point 8-bit integer, and represents the quantized fixed-point number. Subsequently, the corresponding dequantized data can be calculated by:


Different from most prior studies that mainly focus on speeding up the inference (i.e., the forward propagation), our INT8 training framework attempts to further accelerate the backward propagation during the training stage, by applying quantization to the gradients. Namely, we pursue the quantize-dequantized gradients from full-precision gradients in a proper way.

To ensure the quantized gradients maintain an unbiased expectation compared with the original ones, we adopt the stochastic rounding following [16]:


Unfortunately, although the stochastic rounding technique limits the quantization error to some extent from the statistical view, the perturbation for each training iteration is still inevitable and harmful for convergence, whose reasons will be discussed in the following section.

(a) the accuracy curve
(b) the loss curve
Figure 2: Crashed training of MobileNetV2 on CIFAR-10 after quantizing gradients to 8-bit.

3.2 Challenges of Gradient Quantization

Gradients determine the direction of optimization and the magnitude of parameter update and thus play a critical role in pursuing high accurate models. In INT8 training, after we apply quantization to gradients, the perturbation introduces deviation to the optimization direction. Once the deviation accumulates to an unacceptable degree, the training process may be unstable and even crash, resulting in severe performance degradation. Figure 2 shows our empirical observation that for some special network architectures like MobileNetV2, directly quantizing gradients causes a rapid crash of training.

(a) gradients are different from weights and activations
(b) gradients keep evolving during training
(c) gradients of different depths have have different patterns
(d) gradients of different structures have different patterns
Figure 3: Distributions of activations, weights and gradients with respect to different layers of MobileNetV2 and training iterations.

To further investigate the essential reasons behind this phenomenon, we conduct detailed analysis on the distribution of gradients during training without gradient quantization, as shown in Figure 3. We surprisingly observe that the gradients own the following unique characteristics:

  • Sharp and Wide. As shown in Figure 3(a), compared to weights and activations, gradients follow an unusual distribution that has more values concentrated around zero while a certain number of extreme values also exists. Therefore, the distribution curve is very sharp with small values taking the majority of gradients, but the range is relatively very wide. This makes many gradients quantized to zero and the quantization error significantly large when using uniform quantization.

  • Evolutionary. Figure 3(b) depicts how the gradient distribution of the same layer evolves with respect to the training iterations. We can find that as the training goes on, the shape of gradient distribution becomes much sharper and narrower, which means it is impossible to fix the quantization settings throughout the training process, as we usually do for weights and activations, such as assuming the same clipping range in the whole training.

  • Depth-Specific. Figure 3(c) compares the distribution of gradients in different layers. It is obvious that the distributions in the shallow layers are sharper with larger extreme values than the deeper layers. This means that the preceding layers of the deep neural networks often face more severe quantization loss.

  • Structure-Specific. As can be seen in Figure 3(d), the gradients of layers with different structures present apparently different patterns. For MobileNetV2, the second convolutional layer in each block is of depth-wise structure. Its gradients own larger range and sharper shape even in the deeper block, making MobileNetV2 harder to quantize from the aspect of gradients.

Based on the above observations, we can conclude that the gradients differ from weights and activations largely, which inevitably causes an unstable training, when simply adopting the common quantization techniques for weights and activations. This means that we need certain techniques to take care of distinctiveness in gradient quantization, which brings great challenges to the real and unified INT8 training in practice.

Before turning to devise the desired techniques considering the speciality of gradients, we first attempt to understand the gradient’s effect on the training stability, by theoretically revealing the connections between training convergence and gradient quantization. This will provide us a reliable clue to build the robust and unified INT8 training framework.

3.3 Stabilize Training: A Theoretical Perspective

As commonly used in the analysis of deep learning optimizers

[12, 28, 48, 39], the ability of convergence is usually evaluated by the regret .


where indicates the number of iterations. is the parameter at time in the convex compact set , and

denotes the corresponding loss function. The optimal parameter is represented by

. If the average regret approaches zero quickly as increases, the speed and ability of convergence can be guaranteed.

Due to the complexity of the DCNNs, it is very difficult to directly analyze its behaviors. As the prior studies [1, 34, 21, 59] do, we first make the following assumptions:

Assumption 1.

is convex;

Assumption 2.


Although the convexity assumption may not hold for deep networks, analysis based on this can provide reasonable and valuable insights for us, which has been proved in previous studies [12, 39, 21, 59].

Taking the standard stochastic gradient descent algorithm into consideration, the optimization based on quantized gradient

and learning rate can be formulated as:


Then we have the following theoretical finding (see the supplementary materials for detailed proof):

Theorem 1.

If define the error of quantized gradients as , then with assumption 1 and 2, we have:


We can find that the bound of average regret is dominated by three terms. Term (1) approaches zero as increases and thus can be ignored in gradient quantization. Term (2) indicates the quantization error of gradients greatly affects the ability to converge, and it is usually large, as analyzed in Section 3.2. For term (3), its magnitude is mainly influenced by the learning rate and l2-norm of quantized gradients. Based on the theoretical analysis, to stabilize INT8 training, we have two basic principles for designing better quantization techniques: (1) reduce the quantization error of gradients; (2) scale down the learning rate. They are also very intuitive since, on the one hand, a lower quantization error means small deviation of optimization direction and thus avoids the training crash, on the other hand, it is a common sense that decreasing the learning rate gradually promises a better solution in the optimization.

Now with the design principles, the question is how to devise the universal techniques for INT8 training, meanwhile take the characteristics of gradients into consideration. We respectively present two novel techniques: Direction Sensitive Gradient Clipping and Deviation Counteractive Learning Rate Scaling, which together lower the average regret bound and guarantee stable INT8 training.

3.4 Direction Sensitive Gradient Clipping

Considering the basic operation in deep neural networks, the gradients of weights actually can be calculated by . From this aspect, the quantization error of in (6) mainly stems from that of activation gradients . Therefore, in our INT8 training we can mainly concern the quantization of , which will help control the error of quantized gradients in (6). For simplicity of notations, in the following discussion we directly use to denote .

To minimize quantization error, previous works mainly seek the optimal clipping value in (1

) by assuming certain data distribution, e.g. Gaussian distribution

[3, 4, 19, 2, 21, 11]

. However, according to the gradient characteristics C1 and C2 we discover, it is unpractical to make a common assumption for an evolutionary and unusual gradient distribution. To further prove this point, we do the Kolmogorov–Smirnov test with distribution parameter solved by maximum likelihood estimation, and report the KS-statistics that consistently reject the assumption that gradients obey any common distribution in Table


Data Distribution Critical value
Gaussian Laplace Student
layer0 0.1934 0.0790 0.2005 0.0012
0.0391 0.0721 0.1011 0.0765
layer8 0.2061 0.1091 0.2303 0.0024
0.0294 0.0569 0.1084 0.0110
Table 1: KS-statistics of gradient and weight with respect to different layers’ conv3 in MobiletNetV2, the last column indicates the maximum value that can accept the hypothesis at significance level of 0.05.

To find the optimal clipping value without any assumption, a straightforward idea is to keep the quantized gradient consistent with the original one by gradient descent algorithm. Usually, one can model the consistency using the popular mean-square error (MSE). Unfortunately, due to characteristics C2 and C3 of gradients with huge discrepancy and fluctuation in their magnitudes, MSE makes the optimization vulnerable and unable to work under the same simple setting across various layers.

Therefore, to pursue the desired clipping values of different layers that promise stable training, we choose cosine distance to guide the learning of clipping values, which not only avoids the negative effect of the varied gradients’ magnitudes, but also keeps the network optimization directions consistent:


where and denote the original floating-point gradient and its quantize-dequantized counterpart.

The cosine distance measures the direction deviation of quantized gradients. As shown in Figure 4, when increases to a certain level, the whole training crashes. There exists strong correlation between and training stability, which proves that cosine distance can effectively reflect the influence of gradient quantization on the convergence. By minimizing the deviation, we subsequently reduce term (2) in (6). Figure 5(a) shows the quantization error using different clipping values, where there exists an optimal clipping value that substantially reduces the cosine distance.

(a) the accuracy curve
(b) the loss curve
Figure 4: Model crashes when exceeds limits.

3.5 Deviation Counteractive Learning Rate Scaling

The theoretical analysis on convergence ability of quantized training indicates the necessity of scaling down learning rate, since the quantization error of gradients cannot vanish completely. To validate this point, we decrease the learning rate of the original crashed training of MobileNetV2 mentioned in Section 3.2 and find that it defers and even eliminates the crash with an extremely low learning rate, although facing a performance degradation (see the red, green and orange lines in Figure 5(b)).

(a) effect of clipping
(b) effect of scaling strategies
Figure 5: The effect of clipping and learning rates on INT8 training. in (a) represents optimal clipping value. In (b), sets initial learning rate as 0.1 with scaling, , and choose 0.01, 0.05, 0.1 as initial learning rate respectively without scale.

Since the gradients are backward propagated layer by layer, the minor gradient deviation will accumulate exponentially after massive multiplication and addition calculation. To address this issue, we further propose the Deviation Counteractive Learning Rate Scaling to balance out the error by exponentially decaying the learning rate according to the degree of direction deviation , the scaling function is formulated at:


where controls the decay degree and limits the lower bound of scaling.

This scaling function generates a factor to scale down the original full-precision learning rate. We empirically find that the self-adapting scaling function performs well in a layer-wise way, adaptively adjusting the learning rate according to the direction deviations in different layers. This counteracts the undesired effects of the gradient deviations across layers, and exactly addresses the challenges of the depth-specific and structure-specific patterns as observed in characteristics C3 and C4 in Section 3.2. The blue line in Figure 5(b) demonstrates that the training equipped with scaling achieves higher accuracy than the manually adjusted ones (tested with MobileNetV2 on CIFAR-10).

Figure 6: Flexible INT8 convolutional layer replacement.

3.6 General Purpose Training Framework

Period 1 10 100 1000
Average time(s/iter) 1.006 0.364 0.301 0.297
Table 2: Overhead reduced with Periodic Update (on ResNet-50).

In addition to ensuring the stable and accurate convergence, in practice our unified INT8 training framework should also satisfy the following three features:

(1) Easy to plug into any DCNN architecture.

To realize this, we implement an automatic match and replacement mechanism in PyTorch

[46] that correspondingly substitutes convolutional and fully-connected layers with 8-bit counterpart. The whole workflow including both forward and backward passes is shown in Figure 6.

(2) No excessive extra computational overhead. To avoid the extra time cost of calculating clipping value, we design a Periodic Update method to optimize the clipping value periodically. As we can see in Table 2, the Periodic Update method dramatically reduces the computational overhead of optimizing the clipping value.

(3) Easy to implement on off-the-shelf hardware.

To validate the potential of that, we utilizes the DP4A instruction (8-bit integer 4-element vector dot product) on low-end NVIDIA Pascal GPUs to implement efficient 8-bit kernels for calculating gradients. To the best of our knowledge, we are the first to achieve practical acceleration of INT8 training including the backward propagation. The detailed speedup will be reported and discussed in Section


4 Experiments

We conduct extensive experiments to demonstrate that our proposed framework is unified for various network structures on popular image classification and object detection tasks with state-of-the-art accuracy, and meanwhile it can be easily deployed on the mainstream devices (NVIDIA Pascal GPU) with satisfactory speedup, compared to full-precision training.

4.1 Ablation Study

Settings. We first conduct the ablation study on CIFAR-10 dataset with MobileNetV2 [51], to validate the effectiveness of the proposed techniques. We use cosine scheduler [1] with initial learning rate set to 0.1 for all experiments. In the Periodic Update experiment, the and in learning rate scaling are set to 20 and 0.1 respectively.

Direction Sensitive Gradient Clipping. Figure 7(a) shows the cosine distance with respect to the training steps. We can observe that conv2 (the second convolutional layer) of each block owns a much larger cosine distance than other layers of the block most of the time. This is consistent with C4 that the gradients of conv2 own sharper shape, indicating that our cosine distance can well reflect the gradient characteristics.

Moreover, as Table 3 lists, our proposed direction sensitive gradient clipping technique indeed prevents INT8 training from crashing, which proves the fact that optimizing a clipping value of gradients to minimize direction deviation can certainly ensure a stable INT8 training.

(a) the cosine distance
(b) the accuracy curve
Figure 7: Analysis of cosine distance and learning rate scaling function.
Clipping method No clipping
Direction Sensitive
Gradient Clipping
Accuracy (%) NaN 93.02
Table 3: Ablation study on clipping method for INT8 training.

Deviation Counteractive Learning Rate Scaling. We evaluate three forms of learning rate scaling strategies without clipping to control variable for a reasonable comparison. The results shown in Figure 7

(b) reveal that linear and quadratic forms are too weak to control optimization direction within the convergence boundary and model crashes in the training process. Compared with linear and quadratic form, the scaling with exponential form is more powerful to counteract the direction deviation and prevents optimization from stepping out of the convergence boundary. We further explore its sensitivity to the selection of hyperparameter in Table

4, and we can see that different settings of and achieve similar accuracy, which presents the stability of our Deviation Counteractive Learning Rate Scaling.

10 10 20 20
0.1 0.2 0.1 0.2
Accuracy (%) 92.82 93.28 93.38 93.27
Table 4: Comparison of different hyperparameters for learning rate scaling.

Periodic Update for clipping value. To reduce the extra computational overhead, we increase the period to update clipping value and find that it brings little hurt to the accuracy, as shown in Table 5. This empirical conclusion brings possibilities for the practical acceleration of INT8 training. Besides, here we apply both gradient clipping and learning rate scaling, and obtain better performance (see that with period 1) than those in Table 3 and 4. This further verifies the positive effects of the two general techniques.

Period 1 10 100 1000
Accuracy (%) 93.66 93.07 93.38 92.75
Table 5: Ablation study on update period.

4.2 Image Classification

Now we consider the popular image classification task that most prior studies choose to evaluate the quantization performance. We experiment with AlexNet [32], ResNet [18], MobileNetV2 [51] and InceptionV3 [52] on CIFAR-10 [31]

and ImageNet (ILSVRC2012)

[10]. The CIFAR-10 dataset contains a training set of 50K images and a testing set of 10k images. Each image is of size 33 with 10 classes. ImageNet (ILSVRC2012) consists of 1.2 million training images and 50K test images with 1000 classes.

Settings. As for the hyperparameters of ResNet, we use the same settings described in [18]. For other neural networks, we use cosine scheduler [1] with initial learning rate set to 0.1. The and in learning rate scaling are set to 20 and 0.1 respectively. Clipping value is updated per 100 iterations for all experiments.

CIFAR-10. As Table 6 shows, our method achieves comparable accuracy on ResNet-20 to FP8 training, but takes much less memory and computation consumption due to the fixed-point operation. Moreover, our method performs surprisingly good on MobileNetV2 (1.01 accuracy drop) and InceptionV3 (even better than full precision model).

ImageNet. Table 7 lists existing state-of-the-art quantized training methods including WAGE [56], WAGEUBN [58] and FP8 training [54]. For AlexNet INT8 training, our method obtains 5.84% improvement over DoReFa-Net [62]. Free from the extra overhead like , our method enjoys higher efficiency than DoReFa-Net. As for the 2-bit weight and 8-bit activation/gradient case, we significantly outperform WAGE with about 3% accuracy gain. What’s more, equipped with our method, the INT8 training for ResNet architecture achieves almost no performance degradation, while none of the previous studies has done that. Compared with the FP8 training method, our method improves the accuracy by nearly 3%. It should be noted that we can directly get a real speedup on popular off-the-shelf devices while methods like FP8 training need specially designed hardware, which means that our framework is more general for unified training acceleration.

As analyzed in [36]

, the convolutional layer occupies most of the training time while other layers like BatchNorm and ReLU are not computation-intensive. Therefore, we mainly focus on quantizing convolutional layers currently and do not quantize BatchNorm layer like RangeBN

[2] and WAGEUBN [58]. Even so, there is still a significant speedup for INT8 training. In addition, we could get comparable accuracy to full precision training, much higher than RangeBN and WAGEUBN.

Networks using INT8 training for the first time. To our best knowledge, we are the first to quantize gradient of MobileNetV2, which is known to be difficult in this community. Our method gets very good performance on both CIFAR-10 and ImageNet datasets using MobileNetV2, with only around 1 accuracy loss. We also try INT8 training on InceptionV3 for the first time, and achieve comparable accuracy to full precision model. Note that for InveptionV3 on CIFAR-10, our INT8 training method can even achieve better performance than the full-precision model.

Model Method
ResNet-20 FP 32/32/32 92.32
FP8 training [54] 8/8/8 92.21
Ours 8/8/8 91.95
MobileNetV2 FP 32/32/32 94.39
Ours 8/8/8 93.38
InceptionV3 FP 32/32/32 94.89
Ours 8/8/8 95.00
Table 6: Results on CIFAR-10 dataset.
Model Method
AlexNet FP 32/32/32 59.84
DoReFa-Net [62] 8/8/8 53.00
Ours 8/8/8 58.84
WAGE [56] 2/8/8 48.40
Ours 2/8/8 51.28
ResNet-18 FP 32/32/32 70.30
WAGEUBN [58] 8/8/8 66.92
FP8 training [54] 8/8/8 67.34
Ours 8/8/8 69.67
ResNet-34 FP 32/32/32 73.68
WAGEUBN [58] 8/8/8 68.50
Ours 8/8/8 73.29
ResNet-50 FP 32/32/32 76.60
WAGEUBN [58] 8/8/8 69.07
Ours 8/8/8 76.34
MobileNetV2 FP 32/32/32 72.39
Ours 8/8/8 71.20
InceptionV3 FP 32/32/32 72.39
Ours 8/8/8 71.20
Table 7: Results on ImageNet dataset.

4.3 Object Detection

To prove the versatility of our method, we further conduct experiments with the popular object detection networks including Faster-RCNN [49], RFCN [8] and RetinaNet [37] on two widely used datasets: PASCAL VOC [14] and COCO [38]. The PASCAL VOC dataset consists of 11k images with 20 classes. The COCO dataset contains more than 20k images and 80 object categories. Note that we are the first to successfully achieve INT8 training on the object detection task.

Settings. As for the hyperparameters, we follow the same rules described in [35]. And and for learning rate scaling are the same as those used in image classification task.

PASCAL VOC. We test RFCN and Faster R-CNN with different backbones, and find that quantized training equipped with our method only suffers a very slight detection accuracy (mAP) drop. The result of RFCN shows that even for a deeper backbone such as ResNet-101, our INT8 training still maintains almost the same accuracy as full-precision.

COCO. On the large scale COCO dataset, we experiment with RetinaNet (one-stage) and Faster R-CNN (two-stage). Our method performs stably with less than 1.8 accuracy degradation on both networks. We find that RetinaNet incurs higher mAP loss than Faster R-CNN, which is inconsistent with the conclusions in the previous study [35]. This may be caused by the fact that the focal loss used in one stage detector is more sensitive to gradient quantization.

Model Backbone Method
mAP (%)
Faster R-CNN ResNet-50 FP 32/32/32 82.0
ResNet-50 Ours 8/8/8 81.9
RFCN ResNet-101 FP 32/32/32 80.8
ResNet-101 Ours 8/8/8 79.1
Table 8: Results on PASCAL VOC Dataset.
Model Backbone Method
mAP (%)
Faster R-CNN ResNet-50 FP 32/32/32 36.2
ResNet-50 Ours 8/8/8 34.95
RetinaNet ResNet-50 FP 32/32/32 36.9
ResNet-50 Ours 8/8/8 35.1
Table 9: Results on COCO Dataset.
Figure 8: INT8 convolution speedup on GPU, where Y-axis indicates (input shape), (kernel number, kernel size) of convolution.
Precision Forward (s) Backward (s) Iteration (s)
FP32 (cuDNN) 0.117 0.221 0.360
INT8 (ours) 0.101 0.171 0.293
Table 10: End-to-end average time for a round of INT8 training. (tested with ResNet-50 on GeForce GTX1080TI, batch size 64.)

4.4 Speed Result on NVIDIA GPU

None of the existing libraries can directly support the complete INT8 training. Thus we implement it by ourselves on NVIDIA Pascal GPU using DP4A instruction to verify the acceleration power of our method. The speedup of each convolutional layer in ResNet-50 is shown in Figure 8. In the forward process using our solution, INT8 can bring an average 1.63 speedup, while in the backward process, it can achieve a higher 1.94 speedup. Table 10 further reports the time consumption and speed improvement of each training round. Even if we only replace the FP32 convolutional layer with the slightly optimized INT8 one, the training time for ResNet-50 can be reduced by about 22%.

5 Conclusions

In this paper, we attempt to build an INT8 training framework for common DCNNs. We found four distinctive characteristics of gradients and then gave two theoretical principles stabilizing training with the convergence bound. Based on that, we proposed Direction Sensitive Gradient Clipping and Deviation Counteractive Learning Rate Scaling. Extensive experiments prove the versatility of our method for various networks and tasks. We reduced the training time by 22% on Pascal GPU with only trivial optimization. If each layer is sufficiently optimized, the training will achieve higher speedup and lower memory consumption. We hope our first successful attempt can help lead the community towards a fully unified INT8 training.


  • [1] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30, pp. 1709–1720. Cited by: §3.3, §4.1, §4.2.
  • [2] R. Banner, I. Hubara, E. Hoffer, and D. Soudry (2018) Scalable methods for 8-bit training of neural networks. External Links: 1805.11046 Cited by: §1, §2, §3.4, §4.2.
  • [3] R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry (2018) Post-training 4-bit quantization of convolution networks for rapid-deployment. arXiv preprint arXiv:1810.05723. Cited by: §1, §3.4.
  • [4] Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017-07) Deep learning with low precision by half-wave gaussian quantization. In CVPR, Cited by: §3.4.
  • [5] Y. Chen, H. Lan, Z. Du, S. Liu, J. Tao, D. Han, T. Luo, Q. Guo, L. Li, Y. Xie, et al. (2019)

    An instruction set architecture for machine learning

    ACM Transactions on Computer Systems (TOCS) 36 (3), pp. 9. Cited by: §1.
  • [6] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1.
  • [7] M. Courbariaux, Y. Bengio, and J. David (2015) BinaryConnect: training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363. Cited by: §2.
  • [8] J. Dai, Y. Li, K. He, and J. Sun (2016-12) R-fcn: object detection via region-based fully convolutional networks. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Cited by: §4.3.
  • [9] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, A. Heinecke, P. Dubey, J. Corbal, N. Shustrov, R. Dubtsov, E. Fomenko, and V. Pirogov (2018-05) Mixed precision training of convolutional neural networks using integer operations. In ICLR, Cited by: §1, §2.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009-07) ImageNet: A Large-Scale Hierarchical Image Database.

    2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.2.
  • [11] R. Ding, T. Chin, Z. Liu, and D. Marculescu (2019-06)

    Regularizing activation distribution for training binarized deep networks

    In CVPR, Cited by: §2, §3.4.
  • [12] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §3.3, §3.3.
  • [13] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: §1.
  • [14] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman (2010-06) The pascal visual object classes (voc) challenge. Int. J. Comput. Vision. Cited by: §4.3.
  • [15] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019-10) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In ICCV, Cited by: §1.
  • [16] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015-07–09 Jul) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1737–1746. Cited by: §3.1.
  • [17] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016-06) EIE. ACM SIGARCH Computer Architecture News 44 (3), pp. 243–254. External Links: ISSN 0163-5964, Link, Document Cited by: §1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. CVPR. External Links: ISBN 9781467388511, Link, Document Cited by: §4.2, §4.2.
  • [19] Z. He and D. Fan (2019-06) Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. In CVPR, Cited by: §3.4.
  • [20] L. Hou and J. T. Kwok (2018-05) Loss-aware weight quantization of deep networks. In ICLR, Cited by: §1.
  • [21] L. Hou, R. Zhang, and J. T. Kwok (2019-05) Analysis of quantized models. In ICLR, Cited by: §3.3, §3.3, §3.4.
  • [22] Huawei Technologies Co., Ltd. Ascend 310. Note: Cited by: §1.
  • [23] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018-06) Quantization and training of neural networks for efficient integer-arithmetic-only inference. 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781538664209, Link, Document Cited by: §1.
  • [24] jcjohnson (2016) Cnn-benchmarks. GitHub. Note: Cited by: §1.
  • [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, New York, NY, USA. Cited by: §5, §6.3.1.
  • [26] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017)

    In-datacenter performance analysis of a tensor processing unit

    In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §1.
  • [27] S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang, and C. Choi (2019-06) Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR, Cited by: §1.
  • [28] D. P. Kingma and J. Ba (2015-05) Adam: A method for stochastic optimization. In ICLR, Cited by: §3.3.
  • [29] U. Köster, T. J. Webb, X. Wang, M. Nassar, A. K. Bansal, W. H. Constable, O. H. Elibol, S. Gray, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao (2017-12) Flexpoint: an adaptive numerical format for efficient training of deep neural networks. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Cited by: §1, §2.
  • [30] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1.
  • [31] A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, pp. 4. Cited by: §4.2.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12. Cited by: §4.2.
  • [33] P. S. Kumar Chellapilla (2006) High performance convolutional neural networks for document processing.. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule, France. Cited by: §6.3.1.
  • [34] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein (2017) Training quantized nets: a deeper understanding. In Advances in Neural Information Processing Systems 30, pp. 5811–5821. Cited by: §3.3.
  • [35] R. Li, Y. Wang, F. Liang, H. Qin, J. Yan, and R. Fan (2019-06) Fully quantized network for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3, §4.3.
  • [36] X. Li, G. Zhang, H. H. Huang, Z. Wang, and W. Zheng (2016) Performance analysis of gpu-based convolutional neural networks. In 2016 45th International Conference on Parallel Processing (ICPP), pp. 67–76. Cited by: §4.2.
  • [37] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017-10) Focal loss for dense object detection. In ICCV, Cited by: §4.3.
  • [38] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), pp. 740–755. Cited by: §4.3.
  • [39] L. Luo, Y. Xiong, Y. Liu, and X. Sun (2019-05) Adaptive gradient methods with dynamic bound of learning rate. In ICLR, Cited by: §3.3, §3.3.
  • [40] J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha (2018) Discovering low-precision networks close to full-precision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191. Cited by: §2.
  • [41] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wuv (2018-05) Mixed precision training. In ICLR, Cited by: §1, §2.
  • [42] A. Mishra and D. Marr (2017) Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §1.
  • [43] NVIDIA Corporation cuDNN Documentation. Note: Cited by: §6.3.1, §6.3.1.
  • [44] NVIDIA Corporation PTX ISA. Note: Cited by: §6.3.1, §6.3.1.
  • [45] S. K. Park and K. W. Miller Random number generators: good ones are hard to find. Commun. ACM. Cited by: §6.3.2.
  • [46] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §3.6.
  • [47] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) XNOR-net: imagenet classification using binary convolutional neural networks. Lecture Notes in Computer Science, pp. 525–542. External Links: ISBN 9783319464930, ISSN 1611-3349, Link, Document Cited by: §2.
  • [48] S. J. Reddi, S. Kale, and S. Kumar (2018-05) On the convergence of adam and beyond. In ICLR, Cited by: §3.3.
  • [49] S. Ren, K. He, R. Girshick, and J. Sun (2015-12) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Cited by: §4.3.
  • [50] C. Sakr and N. Shanbhag (2019-05) Per-tensor fixed-point quantization of the back-propagation algorithm. In ICLR, Cited by: §1.
  • [51] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.1, §4.2.
  • [52] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016-06) Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4.2.
  • [53] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2018) HAQ: hardware-aware automated quantization. arXiv preprint arXiv:1811.08886. Cited by: §1, §2.
  • [54] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018-12) Training deep neural networks with 8-bit floating point numbers. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, Cited by: §1, §2, §4.2, Table 6, Table 7.
  • [55] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng (2018-06) Two-step quantization for low-bit neural networks. IEEE CVPR. Cited by: §1.
  • [56] S. Wu, G. Li, F. Chen, and L. Shi (2018-05) Training and inference with integers in deep neural networks. In ICLR, Cited by: §1, §2, §4.2, Table 7.
  • [57] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X. Hua (2019-06) Quantization networks. In CVPR, Cited by: §1.
  • [58] Y. Yang, S. Wu, L. Deng, T. Yan, Y. Xie, and G. Li (2019) Training high-performance and large-scale deep neural networks with full 8-bit integers. External Links: 1909.02384 Cited by: §1, §2, §4.2, §4.2, Table 7.
  • [59] P. Yin, S. Zhang, J. Lyu, S. Osher, Y. Qi, and J. Xin (2019) Blended coarse gradient descent for full quantization of deep neural networks. Research in the Mathematical Sciences 6 (1), pp. 14. Cited by: §3.3, §3.3.
  • [60] D. Zhang, J. Yang, D. Ye, and G. Hua (2018-09) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In ECCV, Cited by: §2.
  • [61] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: §1.
  • [62] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160. External Links: Link Cited by: §1, §1, §2, §4.2, Table 7.

6 Supplementary Material

6.1 Proof of Theorem 1

Assumption 1.

is convex;

Assumption 2.



Considering the update for th entry of weight,


we have


Rearrange the equation, and divide on both side as is none-zero,


The error of quantized gradients is defined as


Replace in the (11) with and , and we can get that


According to assumption 1,


So combine the (13) and (14), sum over the dimensions of and the iterations, then the regret


Combine (15) with the assumption 2, and we can further relax the above (15) to


Assume that all layers have the same learning rate, then


Based on Cauchy’s inequality and assumption 2, we finally get


Thus the average regret


6.2 INT8 Training Stability

We plot the accuracy and the loss curve of MobileNetV2 training on CIFAR-10 dataset and ResNet-50 training on ImageNet dataset to show the stability of INT8 training. From Figure 9 and Figure 10, we can see that our method makes INT8 training smooth and achieves accuracy comparable to FP32 training. The quantization noise increases exploratory ability of INT8 training since the quantization noise at early stage of training could make the optimization direction more diverse, and with properly reduced learning rate, INT8 training sometimes even converge faster than FP32 training.

(a) the accuracy curve
(b) the loss curve
Figure 9: Comparison of INT8 training and FP32 training on CIFAR-10 using MobileNetV2.
(a) the accuracy curve
(b) the loss curve
Figure 10: Comparison of INT8 training and FP32 training on ImageNet using ResNet-50.

6.3 INT8 Convolution Speed Up Algorithm

6.3.1 INT8 Convolution

On NVIDIA GPUs with Pascal architectures (such as GP102, GP104, and GP106), the new 8-bit integer 4-element dot product with accumulation (DP4A) [44] instruction is supported. This enables the NVIDIA GeForce GTX 1080Ti (based on GP102) to achieve a peak integer throughput of 44 Tera Operations Per Second (TOPS), while the peak float throughput is only 11 Tera Float Operations Per Second (TFLOPS).

Since the release of cuDNN 6.0 [43], INT8 inference is supported but the INT8 backward process is not implemented. So we use the DP4A instruction to implement the INT8 backward process by ourselves. Moreover, we find that the quantization process before INT8 convolution computation is pretty time-consuming as the quantization needs to read and write the whole data. In order to reduce the overhead that quantization brings, we fuse the quantization process with the convolution computation (quantization-convolution fused kernel). In Figure 11, we can see that the combination of quantization and convolution could avoid one extra global memory read and write effectively. Thus we rewrite the INT8 forward and backward process using this quantization-convolution fused kernel and achieve a significant speed-up.

Figure 11: Quantization-convolution fused kernel avoids one extra global memory read and write.
Figure 12: 44 8-bit integer block transpose in a thread using instruction.

In our implementation, we transpose the data layout into NC4HW so that we can use the DP4A instruction to conduct the convolution computation. We use the instruction in Parallel Thread Execution and Instruction Set Architecture (PTX ISA) [44] to transpose the data efficiently. This instruction picks four arbitrary bytes from two 32-bit registers, and reassembles them into a 32-bit destination register. Figure 12 shows that one thread can transpose data in 44 8-bit integer block by using 12 instructions with shared memory. The transpose implementation code is listed below.

int regLDG[4]; int4 regPRMT; int tmp;
asm volatile("prmt.b32 %0, %1, %2, 0x0040;" : "=r"(regPRMT.x) : "r"(regLDG[0]), "r"(regLDG[1]));
asm volatile("prmt.b32 %0, %1, %2, 0x0040;" : "=r"(tmp) : "r"(regLDG[2]), "r"(regLDG[3]));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(regPRMT.x) : "r"(regPRMT.x), "r"(tmp));
asm volatile("prmt.b32 %0, %1, %2, 0x0051;" : "=r"(regPRMT.y) : "r"(regLDG[0]), "r"(regLDG[1]));
asm volatile("prmt.b32 %0, %1, %2, 0x0051;" : "=r"(tmp) : "r"(regLDG[2]), "r"(regLDG[3]));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(regPRMT.y) : "r"(regPRMT.y), "r"(tmp));
asm volatile("prmt.b32 %0, %1, %2, 0x0062;" : "=r"(regPRMT.z) : "r"(regLDG[0]), "r"(regLDG[1]));
asm volatile("prmt.b32 %0, %1, %2, 0x0062;" : "=r"(tmp) : "r"(regLDG[2]), "r"(regLDG[3]));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(regPRMT.z) : "r"(regPRMT.z), "r"(tmp));
asm volatile("prmt.b32 %0, %1, %2, 0x0073;" : "=r"(regPRMT.w) : "r"(regLDG[0]), "r"(regLDG[1]));
asm volatile("prmt.b32 %0, %1, %2, 0x0073;" : "=r"(tmp) : "r"(regLDG[2]), "r"(regLDG[3]));
asm volatile("prmt.b32 %0, %1, %2, 0x5410;" : "=r"(regPRMT.w) : "r"(regPRMT.w), "r"(tmp));

After transposition, we use two kinds of algorithms im2col plus GEMM [25, 33] and implicit GEMM [43] to implement convolution, and choose a faster algorithm for each convolution layer before training. Through these two algorithms, we convert the original convolution into dot product. Then we use one float load instruction to load four INT8 data and one DP4A instruction to compute four INT8 dot product operations. This can speed up the INT8 convolution significantly.

6.3.2 Stochastic Rounding

Due to the use of stochastic rounding in quantizing gradients, we need to generate uniform random numbers during the backward process. One way to generate random numbers is using , but this instruction needs extra global memory access, which will significantly degrade our INT8 convolution performance, with time consumption increasing over 100. Another method is to use , and we need to set a unique for each thread to get different random numbers, which requires a large amount of gpu memory. Worse still, this method runs as slow as the first method. Considering both disadvantages above, we use Linear Congruential Generator (LCG) [45] to yield a sequence of pseudo-randomized numbers instead.

The generator is defined by recurrence relation,


where is the sequence of pseudo-random values, is the modules, is the multiplier, is the increment, and is the random seed. The parameters , and are set to constants.

In order to get different random seeds in each thread, we set the random seed to first input data and add the thread index to . With above settings, each thread can get a unique random seed. The LCG method generates random numbers quickly and brings slight time consumption to INT8 convolution.