Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers

09/05/2019 ∙ by Yukuan Yang, et al. ∙ Tsinghua University The Regents of the University of California Beijing Institute of Technology 0

Deep neural network (DNN) quantization converting floating-point (FP) data in the network to integers (INT) is an effective way to shrink the model size for memory saving and simplify the operations for compute acceleration. Recently, researches on DNN quantization develop from inference to training, laying a foundation for the online training on accelerators. However, existing schemes leaving batch normalization (BN) untouched during training are mostly incomplete quantization that still adopts high precision FP in some parts of the data paths. Currently, there is no solution that can use only low bit-width INT data during the whole training process of large-scale DNNs with acceptable accuracy. In this work, through decomposing all the computation steps in DNNs and fusing three special quantization functions to satisfy the different precision requirements, we propose a unified complete quantization framework termed as "WAGEUBN" to quantize DNNs involving all data paths including W (Weights), A (Activation), G (Gradient), E (Error), U (Update), and BN. Moreover, the Momentum optimizer is also quantized to realize a completely quantized framework. Experiments on ResNet18/34/50 models demonstrate that WAGEUBN can achieve competitive accuracy on ImageNet dataset. For the first time, the study of quantization in large-scale DNNs is advanced to the full 8-bit INT level. In this way, all the operations in the training and inference can be bit-wise operations, pushing towards faster processing speed, decreased memory cost, and higher energy efficiency. Our throughout quantization framework has great potential for future efficient portable devices with online learning ability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks [13] have achieved state-of-art results in many fields like image processing [26], object detection [19]

, natural language processing

[3], and robotics [27]

through learning high-level features from a large amount of input data. However, due to the existence of a huge number of floating-point (FP) values and complex FP multiply-accumulate operations (MACs) in the process of network training and inference, the intensive memory overhead, large computational complexity, and high energy consumption impede the wide deployment of deep learning models. DNN quantization

[22] which converts FP MACs to bit-wise operations is an effective way to reduce the memory and computation costs and improve the speed of deep learning accelerators.

With the deepening of research, DNN quantization gradually transfers from inference optimization (BWN [4]

, XNOR-Net

[18], ADMM [15]) to training optimization (DoReFa [28], GXNOR-Net [8], FP8 [24]). Usually, the inference quantization focuses on the forward pass; while the training quantization further quantizes the backward pass and weight update. Recently, the training quantization becomes a hot topic in the network compression community. Whereas, there are still two major issues in existing schemes. The first issue lies in the incomplete quantization, including two aspects: partial quantization and FP dependency. Partial quantization means that only parts of dataflow, not all of them, are quantized (e.g. DoReFa [28], GXNOR-Net [8] and QBP2 [1]); FP dependency still remains FP values during the training process (e.g. MP [16] and FP8 [24]). The second issue is that the quantization of batch normalization (BN) [10] is ignored by most schemes (e.g. MP-INT [6] and FX Training [20]). BN is an essential layer for the training of DNNs by addressing the problem of internal covariate shift of each layer’s inputs, especially as the network deepens, allowing a much higher learning rate and less careful weight initialization.

Compared with all the studies above, WAGE [25] is the most thorough work of DNNs quantization, which quantizes the data including W (Weights), A (Activation), G (Gradient), E (Error), U (Update) and replacing each BN layer with a constant scaling factor. WAGE has achieved competitive results on LeNet [14], VGG [21], and AlexNet [12], providing a good inspiration for this work. However, we find that WAGE is difficult to be applied in large-scale DNNs due to the absence of BN layers. Besides, it is known that the gradient descent optimizer such as Momentum [17] or Adam [11] increases the stability and even helps get rid of the local optimum, thus the speed and final performance are significantly improved. A complete quantization should cover the entire training process, including W, A, G, E, U, BN, and the optimizer. Regretfully, up to now, there is still no such solution that can achieve this complete quantization, especially on large-scale DNNs.

To address the issues and realize a completely quantized training of large-scale DNNs, we propose a unified quantization framework termed “WAGEUBN” to constrain W, A, G, E, U, BN, and the optimizer in the low-bit integer (INT) space. All computation steps and operands in DNNs are decomposed and quantized. According to various data distributions in DNN training, we fuse three quantization functions to satisfy the different precision requirements. Furthermore, we find that 8-bit errors lose too much information and cause non-convergence due to the insufficient data coverage. To solve this problem, we propose a new storage and computing method by introducing a flag bit to expand the data coverage. Under the WAGEUBN framework, all operations in the training and inference of DNNs can be simple bit-wise operations. WAGEUBN shows competitive accuracy and much less memory cost, training time, and energy consumption on ResNet18/34/50 over ImageNet [7] dataset. This work provides a feasible idea for the architecture design of future efficient online learning devices. The contributions of this work are twofold, which are summarized as follows:

  • We address the two issues existing in most quantization schemes via fully quantizing all the data paths, including W, A, G, E, U, BN, and the optimizer, greatly reducing the memory and compute costs. What’s more, we constrain the data to INT8 for the first time, pushing the training quantization to a new bit level compared with the existing FP16, INT16, and FP8 solutions.

  • Our quantization framework is validated in large-scale DNN models (ResNet18/34/50) over ImageNet dataset and achieves competitive accuracy with much fewer overheads, indicating great potential for future portable devices with online learning ability.

The organization of this paper is as follows: Section II introduces the related work of DNN quantization; Section III details the WAGEUBN framework; Section IV presents the experiment results of WAGEUBN and the corresponding analyses; Section V summarizes this work and delivers the conclusion.

Ii Related Work

With the wide applications of DNNs, the related compression technologies have been proposed rapidly, among which the quantization plays an important role. The development of DNN quantization can be divided into two stages, inference quantization and training quantization, according to the different quantization objects.

Inference quantization: Inference quantization starts from constraining W into (BWN [4]), replacing complex FP MACs with simple accumulations. BNN [5] and XNOR-Net [18] further quantize both W and A, making the inference computation dominated by bit-wise operations. However, extremely low bit-width quantization usually leads to significant accuracy loss. For example, when the bit width comes to 4 bits, the accuracy degradation becomes obvious, especially for large-scale DNNs. Instead, the bit width of W and A for inference quantization can be reduced to 8 bits with little accuracy degradation. The study of inference quantization is sufficient for the deep learning inference accelerators. Whereas, this is not enough for efficient online learning accelerators because only the data in the forward pass are considered.

Training quantization: To further extend the quantization towards the training stage, DoReFa [28] trains DNNs with low bit-width W, A, and G, while leaving E and BN unprocessed. MP [16] and MP-INT [6] use FP16 and INT16 values, respectively, to constrain W, A, and G. Recently, FP8 [24] further pushes W, A, G, E, and U to 8, 8, 8, 8, and 16-bit FP values, respectively, still leaving BN untouched. QBP2 [1] replaces the conventional BN with range BN and constrains W, A, and E to INT8 values, while calculating G with FP MACS. Recently, WAGE [25] adopts a layer-wise scaling factor instead of using the BN layer and quantizes W, A, G, E, and U to 2, 8, 8, 8, and 8 bits, respectively. Despite its thorough quantization, WAGE is difficult to be applied to large-scale DNNs due to the absence of powerful BN layers. In summary, there still lacks a complete INT8 quantization framework for the training of large-scale DNNs with high accuracy.

Iii WAGEUBN Framework

The main idea of WAGEUBN is to quantize all the data in DNN training to INT8 values. In this section, we detail the WAGEUBN framework implemented in large-scale DNN models. The organization of this section is as follows: Subsection III-A and Subsection III-B describe the notations and quantization functions, respectively; Subsection III-C explains the specific quantization schemes for W, A, G, E, U, BN, and the Momentum optimizer, respectively; Subsection III-D goes through the overall implementation of WAGEUBN, including in both forward and backward passes; Subsection III-E summarizes the whole process and shows the pseudo codes.

Iii-a Notations

Before introducing the WAGEUBN quantization framework formally, we need to define some notations. Considering the -th layer of DNNs, we divide the forward pass of DNNs into four steps as described in Figure 1 (BN is divided into two steps: Normalization & and Scale & Offset).

Fig. 1: Forward quantization of the -th layer in DNNs.

Specifically, we have

(1)

where is the output of the -th layer and is the input of the -th layer; is the quantized weight for convolution; is the quantization function used for constraining to low-bit INT values, which will be given in Equation (12); and

are the quantized mean and standard deviation value of one mini-batch;

and are the quantized scale and bias used in the BN layer of DNNs; is the quantization function for activation which will be detailed in Equation (13);

is the activation function which is commonly used in DNNs. Noting that every step in the forward pass is quantized,

are all integers. In order to make the notations used in the paper consistent, we make the following rules: subscript denotes the data which have been quantized to INT values and superscript denote the layer index.

Fig. 2: Backward quantization of the -th layer in DNNs.

Different from most existing schemes, we define and respectively, where

represents the gradient of A (activation) which is used in the error backpropagation and

represents the gradient of W (weights) which is used in the weight update. Moreover, we quantize the BN layers, including both the forward and backward passes, which is not well touched in most prior work. Similar to the forward pass, we divide the backward pass of the -th layer into five steps as shown in Figure 2

. According to the derivative chain rules, we have

(2)

where

is the loss function,

represents the error from the -th layer, and

represents the Hadamard product. For vectors with the same dimension, such as

and , we have: . Two quantization functions are used here: is the quantization function detailed as Equation (14) that converts high bit-width integers to low bit-width integers; detailed as Equation (16) is trying to convert FP values to low bit-width integers. is the transposed matrix of , and represents the gradient of activation. When is used as the activation function,

is a tensor containing only 0 and 1 elements.

According to the definitions given above, the gradients of W, , and can be summarized as follows

(3)

To further reduce the bit width of G that will increase greatly after the multiplication, we have

(4)

where , , and are quantization functions for the gradient of W, , and , respectively, which will be shown in Equation (17).

Some notations to be used below are also explained here. , , , ( and ), and are the bit width of W, A, G, E, and BN, respectively. , , and are the bit width of W, , and update, which are also the bit width of data stored in memory. , , , and are the bit width of , , , and , respectively, used in the BN layer. and are the bit width of and gradient, respectively. and are the bit width of momentum coefficient () and accumulation (), respectively, used in the Momentum optimizer. At last, is the bit width of the learning rate.

Iii-B Quantization Functions

There are three quantization functions used in WAGEUBN. The direct-quantization function uses the nearest fixed-point values to represent the continuous values of W, A, and BN. The constant-quantization function for G is used to keep the bit width of U (update) fixed since G is directly related to U. Because U and the weights stored in memory have the same bit width, the bit width of weights stored in memory can be fixed, which is more hardware-friendly. The magnitude of E is very small, so the shift-quantization function reduces the bit width of E greatly compared with the direct-quantization function under the same precision.


(1) Direct-quantization function

The direct-quantization function simply approximates a continuous value to its nearest discrete state and is defined as

(5)

where is the bit width, and rounds a number to its nearest INT value.


(2) Constant-quantization function

The intention of constant-quantization function is to normalize a tensor firstly, then limit it to INT, and finally maintain its magnitude. It is governed by

(6)
Fig. 3: Illustration of the constant-quantization function during training.

The illustration of constant-quantization function is described in Figure 3. Here, is used to project the maximum value of to its nearest fixed-point value, which is prepared for normalization; is a stochastic rounding function used for converting a continuous float value to its nearby INT value in a probabilistic manner and

is the rounding probability;

denotes normalization and is a saturation function limiting the data range; is to shift the distribution of and limit to INT values between and . Here limits the data range after mapping and decreases as the training goes on, presenting the same effect as reducing the learning rate. For example, maps G to {-127, -126, , 126, 127} and {-63, -62, , 62, 63} in the early training stage (,

, epoch in

) and later training stage (, , epoch in ), respectively. is utilized to maintain the magnitude order of data, where is a constant scaling factor and is its bit width.


(3) Shift-quantization function

The shift-quantization function serves for the quantization of E and is defined as

(7)

where is the minimum interval for a k-bit INT and is the direct-quantization function defined in Equation (5).

The shift-quantization function normalizes E first, then converts E to fixed-point values, and finally uses a layer-wise scaling factor ( defined in Equation (6)) to maintain the magnitude. The differences between the constant-quantization function and the shift-quantization function mainly exist in two points: First, the constant-quantization uses a constant to keep the magnitude for hardware friendliness while the shift-quantization uses a lay-wise scaling factor; Second, the constant-quantization contains a stochastic rounding process while the shift-quantization function does not.

Iii-C Quantization Schemes in WAGEUBN

After introducing the quantization functions used in our WAGEUBN framework, we provide the detailed quantization schemes.


(1) Weight Quantization

Since weights are stored and used as fixed-point values, weights should be also initialized discretely. An initialization method proposed by MSRA [9] has been evidenced helpful for faster training. The initialization of weights can be formulated as follows

(8)

where is the layer’s fan-in number, and is the bit width of weight update and the memory storage.

Because of the different bit width for weight storage and computation, it should be quantized from bits to (the bit width of weights used for convolution) bits for convolution. In addition, we also limit the data range of W. Finally, the quantization function for W is

(9)

(2) Batch Normalization Quantization

As aforementioned, BN plays an important role in training large-scale DNNs. WAGE [25] has proved that simple scaling layers are not enough to replace BN layers. Conventional BN layer can be divided into two steps as

(10)

where and are the mean and standard, respectively, deviation of over one mini-batch in the -th layer; is a small positive value added to to avoid the case of dividing by zero; and are the scale and offset parameters, respectively.

Under the WAGEUBN framework, the BN layer is also quantized. Through the operations described in Equation (11), all operands are quantized and all operations are bit-wise. Specifically, the quantization follows

(11)

where are the quantization functions converting the operands to fixed-point values defined as

(12)

And is a small fixed-point value, playing the same role as in Equation (10); are the bit width of and , respectively.


(3) Activation Quantization

After the convolution and BN layers in the forward pass, the bit width of operands increases due to the multiplication operation. To reduce the bit width and keep the input bit width of each layer consistent, we need to quantize the activations. Here, the quantization function for activations can be described as

(13)

where is the bit width of activations.


(4) Error Quantization

In Equation (2), we have given the definition of E and quantized E. Through investigating the importance of error propagation in DNN training, we find that the quantization of E is very essential for the model convergence. If E is naively quantized using the direct-quantization function, it will require a large bit width of operands to realize the convergence of DNNs. Instead, we use the following shift-quantization function

(14)

where is the shift-quantization function defined in Equation (7), and is the bit width of defined in Equation (2).

As mentioned above, we use and for the error quantization. However, the precision requirements of and vary a lot. Experiments show that affects little on accuracy while will cause the non-convergence of large-scale DNNs when using as the quantization function. is a proper value for the training of DNNs with minimum accuracy degradation. More analyses will be given in Subsection IV-E. Here we will provide two versions of , the 16-bit and 8-bit versions. The 16-bit is defined as

(15)

where is the bit width of defined in Equation (2).

Experiments have proved the data range covered by 8-bit () is not sufficient to train DNNs. In order to expand the coverage of quantization function while still maintaining a low bit width, we introduce a layer-wise scaling factor and a flag bit. Then, to distinguish it from defined in Equation (15), we name the quantization function Flag and the quantization process is governed by

(16)

where , , and .

Fig. 4: Data format of 9-bit integers.

By introducing a layer-wise scaling factor and a flag bit, we can expand the data coverage greatly. Details can be found in Figure 4. The flag bit is used to indicate whether the absolute value of stored in memory is less than the layer-wise scaling factor (e.g., 0 represents and 1 represents ). The sign bit is used to denote the positive or negative direction of the value. The data bit follows the conventional binary format. According to the definition, the values stored in Figure 4 and 4 are and when , respectively. Therefore, the 9-bit data format can cover almost the same data range as the direct 15-bit quantization described in Equation (15). Since the flag bit is just used for judgment, the effective value for computation is still INT8.


(5) Gradient Quantization

The gradient is another important part in DNN training because it is directly related to the weight update. The rules for calculating and quantizing the gradients of W, , and are described as Equation (3) and (4). Since , , , and are all fixed-point values, the conventional FP MACs operations can be replaced with bit-wise operations during the process of calculating , , and . The quantization functions are defined to further reduce the bit width of gradients and prepare for the next step of the optimizer. Specifically, we have

(17)

where is the constant-quantization function defined in Equation (6); , , and are the bit width of the gradient of W, , and , respectively.


(6) Momentum Optimizer Quantization

Momentum optimizer is one of the most common optimizers used in DNN training, especially for classification tasks. For the -th training step of the -th layer, the conventional Momentum optimizer works as follows

(18)

where and are the accumulation in the -th and -th training step, respectively; is a constant value used as a coefficient; is the gradient of W, , or .

Momentum optimizer under the WAGEUBN framework is trying to constrain all operands to fixed-point values. The process can be formulated as

(19)

where is the quantized accumulation in the -th training step; is the quantized gradient of W, , or ; is the quantization function defined as

(20)

To guarantee the consistency of bit width, we further set

(21)

(7) Update Quantization

The parameter update is the last step in the training of each mini-batch. Different from conventional DNNs where the learning rate can take any FP value, the learning rate under WAGEUBN must also be a fixed-point value and the bit width of update is directly related to the bit width of learning rate. The update under quantized Momentum optimizer can be described as

(22)

where is the update of W with bits, and is the fixed-point learning rate with bits. The updates of and are the same as in Equation (22). According to Equation (19), (21), and (22), we have

(23)

Through our evaluations, the precision of the update has the greatest impact on the accuracy of DNNs because it is the last step to constrain the parameters. Thus, we need to set a reasonable bit width for update to balance the model accuracy and memory cost.

Iii-D Quantization Framework

Fig. 5: Overview of the WAGEUBN quantization framework. “BW” denotes bit-wise operations.

Given the quantization details of W, A, G, E, U, BN, and the Momentum optimizer, the overall quantization framework is depicted in Figure 5. Under this framework, conventional FP MACs can be replaced with bit-wise operations. Here, the forward pass of the -th layer in DNNs is divided into three parts: Conv (convolution), BN, and activation. , , , , and are defined in Equation (1). The weights () are stored as -bit integers and then maps to -bit INT values () before convolution. After convolution, and operations are used to calculate the mean and standard deviation of in one mini-batch and then quantize them to and bits, respectively. The operation constrains to bits in BN. Similar to W, and are stored as and -bit integers and used in and bits ( and ) after the and quantization, respectively. After the second step of BN, activation and quantization are implemented with the operation, reducing the increased bit width to bits again and preparing inputs for the next layer.

The backward pass of the -th layer is much more complicated than the forward pass, including error propagation, gradient of weight, gradient of BN, Momentum optimizer, and weight update. In the process of error propagation, , , , , and are defined in Equation (2) and there are two locations needing quantization using and . reduces the bit width of from to . is the derivative of activation function () and is used to constrain to bits. In the phase of calculating the gradients of weights and BN, , , and are leveraged to reduce the increased bit width caused by the multiplication operations.

All parameters of the Momentum optimizer in the -th training step are quantized. Different from the conventional learning rate with FP value, WAGEUBN requires a discrete learning rate so that the bit width of weight updates can be controlled. The updates of and in BN layers are also similar to the weight update, which are omitted in Figure 5 for simplicity.

Iii-E Overall Algorithm

1.35

Algorithm 1 Forward pass of - layer
  Convolution:
    ,
    
  BN:
     ,
    
     ,
    
  Activation:
    

1.5

Algorithm 2 Backward pass of - layer
  Error propagation:
    
    
    
    
    
  Gradient computation:
    
    
    
    
    
    
  Momentum optimizer(- step):
    
    
  Weight updates(- step):
    
    

Given the framework of WAGEUBN, we summarize the entire quantization process and present the pseudo codes for both the forward and backward passes as shown in Algorithm 1 and 2, respectively.

Iv Results

Iv-a Experimental Setup

To verify the effectiveness of the proposed quantization framework, we apply WAGEUBN on ResNet18/34/50 on ImageNet dataset. We provide two versions of WAGEUBN, one with full 8-bit INT where , , , , , , and are equal to 8. The other version has 16-bit . The only difference between the 16-bit version and the full 8-bit version exists in the quantization function (see Equation (15) and (16), respectively). , , and are 15 and we set , , , and to 3, 13, 10, and 24 respectively to satisfy Equation (21) and (23). In addition, we set , , and to 16. Since W, A, G, and E occupy the majority of memory and compute costs, their bit-width values are reduced as much as possible. Other parameters occupying much less resources can increase the bit width to maintain the accuracy, e.g., , , , and in BN layers.

The first and last layers are believed to differ from the rest because of their interface with network inputs and outputs. The quantization of these two layers will cause significant accuracy degradation compared to hidden layers and they just consume few overheads due to the small number of neurons. Therefore, we do not quantize the first and last layers, as previous work did

[23, 2].

Iv-B Training Curve

Fig. 6: Training curves under the WAGEUBN framework: (a) ResNet18; (b) ResNet34; (c) ResNet50.

Figure 6 illustrates the accuracy comparison between vanilla DNNs (FP32), DNNs with 8-bit version of WAGEUBN, and DNNs with 16-bit version of WAGEUBN. The training curves show that there is little difference between vanilla DNNs and the ones under the WAGEUBN framework when the training epoch is less than 60, which reflects the effectiveness of our approach. As the epoch evolves, the accuracy gap begins to grow because the learning rate in vanilla DNNs is much lower than that in WAGEUBN, such as v.s. (), thus the update of vanilla DNNs is more precise than that under the WAGEUBN framework. We can further improve the accuracy by reducing the learning rate, while the bit width values of learning rate and update need to increase accordingly at the expense of more overheads.

Network Accuracy Top-1/Top-5(%)
32 32 32 32 32 32 68.70/88.37
ResNet18 8 8 8 8 16 24 66.92/87.42
8 8 8 8 8 24 63.62/84.80
32 32 32 32 32 32 71.99/90.56
ResNet34 8 8 8 8 16 24 68.50/87.96
8 8 8 8 8 24 67.63/87.70
32 32 32 32 32 32 74.66/92.13
ResNet50 8 8 8 8 16 24 69.07/88.45
8 8 8 8 8 24 67.95/88.01
TABLE I: Accuracy of vanilla DNNs and WAGEUBN DNNs on ImageNet dataset.

Table I quantitatively presents the accuracy comparison between vanilla DNNs and WAGEUBN DNNs on ImageNet dataset. We have achieved the state-of-the-art accuracy on large-scale DNNs with full 8-bit INT quantization. The 16-bit WAGEUBN only loses 3.62% mean accuracy compared with the vanilla DNNs. Because the bit width of most data keeps the same between the full 8-bit WAGEUBN and the 16-bit WGAEUBN, the overhead difference between them is negligible. Compared with the vanilla DNNs, about memory size shrink, much faster processing speed, and much less energy and circuit area can be achieved under the proposed WAGEUBN framework.

Iv-C Quantization Strategies for W, A, G, E, and BN

In our WAGEUBN framework, we use different quantization strategies for W, A, G, E, and BN, i.e. for W, A, and BN; for G, and for E. Different quantization strategies are based on the data distribution, data sensitivity, and hardware friendliness. Figure 7 shows the distribution comparison between W, BN ( defined in Equation (1)), A, G (weight gradient), and E (, defined in Equation (2)) before and after quantization.

Fig. 7: Data distribution comparison between W, BN, A, G, and E before and after quantization.

According to the definition, the resolution of direct-quantization function is when the bit width equals to 8 and there is no limitation on the data range. Because W, BN, and A in the inference stage directly affect the loss function and further influence the backpropagation, the quantization of W, BN, and A should be as precise as possible to avoid the loss fluctuation. This is guaranteed for the reason that the resolution of direct-quantization function is enough for W, BN, and A, which indicates that the direct-quantization function barely changes their data distributions.

The constant-quantization function has a resolution of and the data range after quantization is about in the case of and . will decrease as the training epoch goes on, causing the data range reduction. Figure 7 reveals that the constant-quantization function changes the data distribution of G greatly while the network accuracy has not declined much as a result. The reason behind this phenomenon is that it is the orientation rather than the magnitude of gradients that guides DNNs to converge. In the meantime, it is easy to ensure that the bit width of update can be fixed when is fixed, which is more hardware-friendly since the bit width of weights stored in memory can also be fixed during training.

The shift-quantization retains the magnitude order and omits the general values whose absolute value is less than when . The 8-bit shift-quantization works well for the quantization of error after activation ( defined in Equation (2)). However, we find the shift-quantization is not enough for the quantization of errors between Conv and BN ( defined in Equation (2)). Therefore, the newly designed quantization function in Equation (16) named 8-bit Flag is utilized. Figure 7 shows that the distribution of E () is almost the same before and after quantization, revealing the validity of the 8-bit Flag quantization function.

Iv-D Accuracy Sensitivity Analysis

To compare the influences of W, A, G, E, and BN quantization individually, we quantize them to 8-bit INT separately with FP32 update. Taking as an example, we quantize only W to 8-bit INT and leaving others (A, BN, G, E, and U) still kept in FP32. The quantization function for single data used here is the same as what Section III-C describes (Equation (16) is used for the the error quantization when ).

Bit-width
Accuracy Top-1/Top-5 (%) 67.98/88.02 68.01/87.96 67.74/87.89 67.88/87.89 67.88/87.92 67.08/87.44
TABLE II: Accuracy sensitivity under WAGEUBN with single data quantization on ResNet18.

The results of ResNet18 under the WAGEUBN framework with single data quantization is shown in Table II. The accuracy of single data quantization reflects the difficulty degree when quantizing W, A, G, E, and BN, separately. From the table, we can see that the quantization of E, especially defined in Equation (2), makes the most impacts on accuracy. In addition, we find that the accuracy heavily fluctuates during training when is constrained to 8-bit INT, which does not appear in the quantization of other data. To sum up, the E data, especially , demands the highest precision and is the most sensitive component under our WAGEUBN framework.

Iv-E Analysis of the Error Quantization between Conv and BN

Error backpropagation is the foundation of DNN training. If the error of E caused by quantization is too large, the convergence of DNNs will be degraded. Especially, because the error quantization between convolution and BN () is directly related to the weight update of the -th layer, the impact of quantization on the model accuracy is critical.

To further analyze the reason why 8-bit (defined in Equation (15) where ) causes the non-convergence of DNNs and compare the distributions of under 8-bit , 8-bit Flag (defined in Equation (16) where ), and full precision, the data distribution of of the first quantized layer on ResNet18 is shown in Figure 8. From the figure, we can see that the distributions under 8-bit quantization and full precision differs a lot and those of 8-bit Flag quantization and full precision are almost the same. The major difference between the 8-bit quantization and full precision lies in the interval of ( is defined in Equation (6)), where the 8-bit quantization forces the data in this range to zero.

Fig. 8: Data distributions of : (a) 8-bit , (b) 8-bit Flag , (c) full precision; (d) Data distribution comparison.

The only difference between 8-bit and 8-bit Flag quantization functions lies in the data range. Theoretically, the covered data range of 8-bit and 8-bit Flag are about and , respectively. Because the distribution of is not uniform, the range covered by different quantization methods varies a lot. The data ratios of 8-bit and 8-bit Flag quantization functions covered by each layer of ResNet18 are illustrated as Figure 9. Although the larger values take greater impacts on the model accuracy in the process of error propagation, the smaller values also contain useful information and occupy the majority. Compared with 8-bit Flag , the data ratio 8-bit covers is too little because of the smaller data range. That is to say, although the most important information contained by the larger values is retained, the information contained by the smaller values is ignored, resulting in the non-convergence of DNNs. In addition, there is also a rough trend that the data ratio decreases as the network becomes shallower, either in the 8-bit or 8-bit Flag quantization.

Fig. 9: Data ratios of 8-bit and 8-bit Flag quantization methods covered by each layer of ResNet18.

Iv-F Cost Discussion

Although it is recognized that DNN quantization can greatly reduce memory and compute costs, resulting in lower energy consumption, quantitative analysis is rarely seen in recent research. In order to compare the full INT8 quantization with other precision solutions (FP32, INT32, FP16, INT16, and FP8) more clearly, we have simulated the processing speed, power consumption, and circuit area for single multiplication and accumulation operation on FPGA platform. Figure 10 shows the results. With FP32 as the baseline, taking the multiplication operation as an example, INT8 can perform 3 faster in speed, 10 lower in power, and 9 smaller in circuit area. Similarly, compared with FP32, the speed of INT8 accumulation is about 9 faster, and the energy consumption and circuit area are reduced by 30. In addition, the INT8 multiplication and accumulation operations are more advantageous than other data type operations, whether it is FP8, INT16, FP16 or INT32. In conclusion, the proposed full INT8 quantization has great advantages in hardware overheads, whether in terms of processing speed, power consumption, and circuit area.

Fig. 10: Comparison of time, power, and area of single multiplication and accumulation operation under different quantization precision: (a) multiplication, (b) accumulation.

V Conclusions

We propose a unified framework termed as “WAGEUBN” to achieve a complete quantization of large-scale DNNs in both training and inference with competitive accuracy. We are the first to quantize DNNs over all data paths and promote DNN quantization to the full INT8 level. In this way, all the operations can be replaced with bit-wise operations, causing significant improvements of memory overhead, processing speed, circuit area, and energy consumption. Extensive experiments evidence the effectiveness and efficiency of WAGEUBN. This work provides a feasible solution for the online training acceleration of large-scale and high-performance DNNs and further shows the great potential for the applications in future efficient portable devices with online learning ability. Future works could transfer to the design of computing architecture, memory hierarchy, interconnection infrastructure, and mapping tool to enable the specialized machine learning chip.

Acknowledgment

The work was supported partially by the National Natural Science Foundation of China (No. 61603209, 61876215).

References

  • [1] R. Banner, I. Hubara, E. Hoffer, and D. Soudry (2018) Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pp. 5145–5153. Cited by: §I, §II.
  • [2] Y. Choi, M. El-Khamy, and J. Lee (2018) Learning low precision deep neural networks through regularization. arXiv preprint arXiv:1809.00095. Cited by: §IV-A.
  • [3] R. Collobert and J. Weston (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §I.
  • [4] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §I, §II.
  • [5] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §II.
  • [6] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, et al. (2018)

    Mixed precision training of convolutional neural networks using integer operations

    .
    arXiv preprint arXiv:1802.00930. Cited by: §I, §II.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §I.
  • [8] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li (2018) GXNOR-net: training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework. Neural Networks 100, pp. 49–58. Cited by: §I.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §III-C.
  • [10] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §I.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §I.
  • [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • [13] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §I.
  • [14] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §I.
  • [15] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin (2018) Extremely low bit neural network: squeeze the last bit out with admm. Cited by: §I.
  • [16] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. (2017) Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: §I, §II.
  • [17] N. Qian (1999) On the momentum term in gradient descent learning algorithms. Neural networks 12 (1), pp. 145–151. Cited by: §I.
  • [18] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §I, §II.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §I.
  • [20] C. Sakr and N. Shanbhag (2018) Per-tensor fixed-point quantization of the back-propagation algorithm. arXiv preprint arXiv:1812.11732. Cited by: §I.
  • [21] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §I.
  • [22] W. Sung, S. Shin, and K. Hwang (2015) Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §I.
  • [23] D. Wan, F. Shen, L. Liu, F. Zhu, J. Qin, L. Shao, and H. Tao Shen (2018) TBN: convolutional neural network with ternary inputs and binary weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 315–332. Cited by: §IV-A.
  • [24] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018) Training deep neural networks with 8-bit floating point numbers. In Advances in neural information processing systems, pp. 7675–7684. Cited by: §I, §II.
  • [25] S. Wu, G. Li, F. Chen, and L. Shi (2018) Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680. Cited by: §I, §II, §III-C.
  • [26]