1 Introduction
Deploying DNN on edge devices, such as Internet-of-Things devices or mobile phones, is challenging on account of the limited computing and storage resources and energy budge provided by these edge devices. For instance, it takes 16 seconds on a mobile to complete an image recognition using VGG16 simonyan2014very, which is intolerable for most applications mao2017modnn. Therefore, it is essential to compress the DNN for lower storage requirements and simpler arithmetic operations.
Among various compression techniques, quantization maps the weight values distributed in the infinite space of real numbers to the finite space of discrete numbers. Existing quantization methods, however, mainly encounter several issues. Uniform quantization, such as INT8 jacob2018quantization, uses an affine function to map real value weights to uniformly distributed integers. This method results in a constant distance between the two adjacent quantized numbers, which denoted as Representing Resolution. The representing resolution becomes coarse as the bit-width decreases, which degrades the model accuracy. Non-uniform quantization, such as power-of-two zhou2017incremental, maps the weight values to exponential space. Non-uniform quantization has a extremely large Representing Range of real number with low bit-width exponents. However, both of them can not balance the relationship between Representing Resolution and Representing Range using a low bit-width. For example, the Representing Resolution of non-uniform distribution is too coarse when the exponent number is large and results in a significant quantization error.
Then, we find that some low bit-width quantization methods have poor generality. They cannot apply to the tasks like object detection and sequential network. For example, LSQ esser2019learned only quantizes the weights representing important image features in the convolution layer. Thus, this method only works for CNN based classification tasks.
Most quantization methods presume that lower bit-width can achieve higher speed up or smaller hardware consumption. The introduced decoder, however, may cause significant hardware and energy overhead. Various accelerator architectures may need different quantization bit-width. For instance, Digital-Signal-Processor (DSP) and CPU with AVX prefer 4-bit and 8-bit quantization because the fundamental Multiply-and-accumulate (MAC) can process 4-bit data. It is desired to build a relationship between the quantization bit-width and the real hardware/energy consumption. Such relationship can guide an efficient quantization given a specific accelerator architecture.
In this work, First, AUSN, what we prose is not simply an algorithm, but a framework, including algorithm, coding, rounding, adaptively adjusting scheme. Second, we follow the principle of minimum accuracy drop and extremely small bit-width, to further salvage redundant bit-width to gain more quantization choices. Contributions of this paper are as follows:
-
We first combines these two quantization methods and proposed a coding scheme without the decoder. The bit-width is divided into two parts, which are the symbol and data. The data part can be further divided into “basic part”, and “subdivision part” for the superposition, which determine the Representing Range and Representing Resolution, respectively. Then, we use the superposition of multiple numbers non-uniformly quantized with extremely small bit-width, to represent the weights and activations.
-
We explore in-depth reasons for the quantization error in existing quantization methods: Clipping Error and rounding errors. According to these errors, we propose an adaptively adjusting scheme to reallocate the bit-width according to the weight distribution, reducing the quantization error.
-
We design the rounding scheme to solve the problems about the extra computation effort and hardware overhead caused by overflow of bit-width and re-quantization.
-
We verify the quantized CNNs, sequential networks, and Yolov3-tiny by AUSN on various datasets without retraining and achieve excellent results that exceed state-of-the-art accuracy, and we prove the savings brought by AUSN in hardware resources and energy consumption on FPGA (see Appendix B).
-
We measure the effect of quantization on the model through information loss from the information theory. We also disscuss the tradeoff between quantization methods and performance of hardware, which can guide the design of accelerator (see Appendix A).
2 Background
2.1 Uniform quantization
Uniform quantization defines a Representing Range and quantize weights in the range by first applying an affine function on them and then rounding to the closest integer. For example, all the floating-point weights are quantized to an integer in in INT8 quantization jacob2018quantization.
If the original weights of the DNN exceed the Representing Range, they will be quantized to the boundary values; if these quantized weights are of significant importance, the DNN model may suffer from substantial accuracy loss. Uniform quantization jacob2018quantization; migacz20178
is widely used both in academia and industry because of the guarantee of high accuracy. For example, many open-source frameworks (e.g., Pytorch, TensorFlow, etc.) support INT8 quantization. Although the
Representing Range of INT8 is broad enough for most DNN models, uniform quantization suffers drastic accuracy drop for DNN model with lower bit-width. The fewer quantized numbers are available to cover the whole weights, the farther apart from each other, consequently, which leads to a significant quantization error.2.2 Non-uniform quantization
The non-uniform quantization methods ye2018unified; mcdanel2019full, similar to clustering, aims at finding some “centers" (i.e., the average value) that can represent the majority of the weight values in the original weight distribution. Therefore, these quantzaition methods need to store these “centers", and the quantized number stores the index of “centers" (such as Deep Compression han2016deep), which causes an obvious resource cost.
Power-of-two quantization johnson2018rethinking; mcdanel2019full quantizes weights to the form of exponents (of power-of-two). The quantized number symbolizes the exponent, , a 3-bit quantized value “111” means that the exponent is 7 and the original value is . It has good Representing Range since the range of quantized weights grows exponentially as the quantization bit-width increases. Besides, the power-of-two scheme is hardware-friendly: the multiplication operation can convert to a shift operation. For example, can be substituted by shifting the bit sequence two bits to the left. Such conversion is of great significance for the DNN hardware accelerator implemented in ASIC or FPGA ding2019req, because a tremendous amount of power- and resource-consuming DSPs in the multiplier can be substituted by the simple-and-effective Look-Up Tables (LUTs). Take Xilinx Artix-7 as an example, this FPGA contains LUTs and DSP Slices, and the LUT resources are hundreds of times of the DSP, LUTs are a promising alternative to the implementation of multiplication.111https://china.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf. Such conversion can typically bring up to 50% resource reductions on Xilinx FPGAs faraone2019addnet.
3 Key idea of AUSN quantization
How close the weights in a quantized model can approach its original value inherently determines its quantization error. We define two metrics to evaluate the quantization error.
-
Rounding Error: The error caused by rounding a weight to the nearest quantized value.
-
Clipping Error: The error caused by clipping weight out of Representing Range to extremum value.
![]() |
![]() |
Uniform quantization pursues finer Representing Resolution result in the larger Clipping Error with the limited bit-width. The finer-grain Representing Resolution is, the more narrow the Representing Range will be, and verse visa. Thus, the accuracy of the DNN model inevitably decreases as the bit-width reduces. As shown in Fig. 1(a), the values quantized to 2bit by the uniform quantization with the same Representing Resolution can only represent , if a value of has to be quantized to the boundary , result in the significant Clipping Error.
In contrast, non-uniform quantization focus on the Representing Range, which cause the higher Rounding Error. Quantized numbers with 2bit using power-of-two quantization can represent a value in the Representing Range of in Fig. 1(a). However, the coarse-grain Representing Resolution appears when the exponent is large. In this example, a weight value of has to be rounded to or , rendering significant Rounding Error. Also, the power-of-two quantization requires one extra bit to represent the sign of power.
For a given bit-width, there is always a trade-off between Representing Range and Representing Resolution. Our method is also driven by the idea of eliminating these errors. The fundamental objective is to improve the power-of-two quantization to have finer Representing Resolution while keeping its good property of having a wide Representing Range with low bit-width. Most of the weights of CNN are concentrated near zero and few of them have large absolute value as shown in Fig. 2(b). This phenomenon is well-known as the “long-term effect” and is commonly observed in most CNNs. Therefore, it is not worthy of using a large amount of bits to represent all the weights evenly. We find an opportunity to reduce the bit-width of the quantized numbers without accuracy loss.
We leverage the superposition of multiple power-of-two quantized numbers to represent the data, rather than rounding the data to the closest power-of-two. As the example in Fig. 1(b), we can represent two weight values, e.g., and , with the superposition of two or three quantized numbers, i.e., and , which eliminates the rounding error.
However, the weight distribution across the layers of the model is differentwang2019haq. Thus, it is critical to choose the proper number of superposition to make the quantized value “close" to the original value. More important, Our method (as shown in Fig. 3) AUSN can also adaptively allocate the composition of bit-width according to the weight distribution of layers. To combine the averagely high accuracy with finer Representing Resolution from uniform quantization and the broad Representing Range and hardware acceleration from power-of-two method, we subdivide the limited bit-width and find the balance between Clipping Error and Rounding Error.
4 AUSN Quantization
![]() |
![]() |
4.1 Decoder-free coding scheme for superposition
We divide the bit-width into three parts, which are “sign”,“basic part”, and “subdivision part”222The following solutions discuss the data part (unsigned number).. In Fig. 2(a), taking 6bit as an example, the bit-width of general quantization consists of one sign bit and five data bits, while bitwidth of power-of-two quantization is composed of two sign bits and four data. AUSN separates the original five-bit data into the three-bit “basic” part and two-bit “subdivision” part. By sharing the same sign bit of the value, the superposition of the “basic” part and “subdivision” part can save 1bit. Given the total bit-width, the proposed AUSN coding scheme allocates the bit-width to the “basic” and “subdivision” part to ensure a good trade-off between resolution and the range of weight distribution according to the later adaptive adjustment.
4.2 Quantization scheme with superposition
Note that the value stored in quantized bits is not the quantized number, but the power. If bits are used to indicate the basic part, we have an intuitive power basis , whose powers are all negative. Thus, we can save them as unsigned numbers and save one “sign” bit. In practice, however, the range of may mismatch with that of . Thus, we introduce a scaling operation called “PreConvert” to make the Representing Range of the quantized model closer to the weight distribution of original model, and reduce the Clipping Error. Since the values stored in bit-width are the powers, the scaling operation on the original values is converted to the shift operation on the power.
The specific process of AUSN quantization is preformed as follows:
First, we calculate the power indicating the largest weight of the layer :
(1) |
Then, we perform the “PreConvert" operation on :
(2) |
We denote the power of basic part after “PreConvert” operation , as , and it satisfies Equation (3), which means we use to approximate the range of .
(3) |
Furthermore, for the -th superposition that have bits respectively, we also have the corresponding superposition power . Then, we use Algorithm 1 to derive the superposition of each quantized number .
Input: Weight , maximum number of superposition , power basis for each superposition
Output:
The quantized number of ,
In the procedure of AUSN quantization with superposition defined (as shown in Fig. 2(b)), a proximate representation of can be expressed as
(4) |
Our AUSN algorithm can naturally round the weight to the quantized number and decide whether to superpose. We can also apply Algorithm 1 to the activations.
4.3 Adaptively adjusting the coding scheme
Our AUSN algorithm can naturally round the weight to the quantized number and decide whether to superpose, and also adaptively adjust internal allocation of bit-width to the base part and the subdivision part according to the Clipping Error and Rounding Error
. For example, AUSN quantization with 5bit, the weight distribution of CONV1 layer, the 4bit for the basic part and 1bit for the subdivision part, while 3bit for the basic part and 2bit for the subdivision part in CONV2 layer. Suppose that the initial distribution of the weights in a layer is normal. Without losing universality, we suppose that the weight follows standard normal distribution
and is clipped at . Since quantization distribution is usually evenly distributed about 0, we only consider the positive part. Suppose that AUSN quantization has a Representing Range and a set of quantized numbers , we can use the following equation to present the Clipping Error and Rounding Error :(5) |
First, AUSN initializes the coding scheme according to the given bit-width and quantizes the weights. Then, the Clipping Error and Rounding Error between the quantized weights and original weights is calculated according to euqation(5). Last, based on and , AUSN adjusts the boundary of Representing Range and the set of quantized value (that is, the allocation of bit-width). The above process is iterated until an optimal allocation of bit-width is found.
4.4 AUSN rounding scheme
Although AUSN quantization has effectively solved the problem of the power-of-two quantization in the reduction of accuracy, there still exists two critical issues for almost quantization methods:
-
Overflow of Bit-width: The output resulted from the convolution operation (Multiply and Accumulate) occupies a larger bit-width to maintain the precision than that of the quantized number. For example, the convolution operation of two matrix consisting of all 8-bit numbers first result in four 16-bit dot-products and then accumulate to the 32-bit sum.
-
Overhead of Re-quantization: In deep neural networks, the output of the previous layer is the input of the next layer. The outputs have a high likelihood of exceeding the range of the quantized numbers for the input layer. Thus, existing methods have to re-quantize the output values — such re-quantization results in significant cost of computation and resource.
We formulate the above problems as follows. Suppose the next layer only accepts an activation value with a double superposition due to the bit-width limit, but in the process of multiplication with AUSN, the number of superposition will increase, e.g., .Thus, we have to eliminate some power-of-two terms (one term in this example) and round up/down the activation value with a minimized rounding error. Consequently, after the multiplication operation, we can substitute the accumulation operation with rounding scheme, which eliminates the overflow and re-quantization issue.
The basic principle of rounding operation is simple enough and to minimize the rounding error. Given the data in the form of a polynomial of power-of-two, we need to determine the power after rounding. Principle: , , so that or .
The rounding scheme can be categorized into the following four scenarios:
Scenario 1:, wherein is the bit-width of the subdivision part indicating the number of superposition.
Scenario 2:When the condition is satisfied, it’s obvious that .
Scenario 3:If , then .
Scenario 4:If , then or .
Based on the above scenarios, we round the data using the following algorithms.
-
Find the maximum terms in data that can apply Scenario 1.
-
Find pair of terms that can apply Scenario 2.
-
Apply Scenario 3 if the condition is met.
-
If Equation (6) work, carry out the Scenario 4.
(6) |
Following the above steps, we can eliminate the number of power-of-two to satisfy and minimize the rounding error. Take the data as example, , that is the quantized number is single superposition of power-of-two. Step1: merge the and get the ; Step 2:merge the and get the ; Then, the condition of step 3 is not met, and thus step 4 is performed: merge the and get the ; Finally, we find the result satisfy the requirement of quantization.
5 Experiments
To evaluate the performance of our algorithm, We quantize several models and verify the accuracy of the models on the different datasets. By using Cifar10 krizhevsky2009learning
and ImageNet
deng2009imagenet, we compare AUSN with several state-of-the-art quantization methods on CNN. We implement the AUSN quantization using the CNN model structures and pre-trained models in Pytorch library333https://github.com/pytorch/vision/tree/master/torchvision/models.In addition, we verified the sequential network and the network used in target detection task through the Google Speech Commands dataset, the VoxCeleb dataset and COCO datasets. We compared our results with full-precision baselines models to demonstrate the broad applicability of our approach.
Remarkably, we quantize the pre-trained models without fine-tuning or retraining.
5.1 Result of quantized networks on Cifar10
From the data in table 1, we can see that both the self-adapting shifting and the better resolution brought by subdivision part enhenced our result. For the 2-bit result, both schemes quantize weights into power-of-twos, but our AUSN find a better covering range and thus get a better result. For the result of 3, 4 and 5-bit quantization, our AUSN method retained the basic part as 3 bits and set the subdivision part as 0, 1 and 2 bits, respectively. Compared with the INQ result, we can find that our accuracy progress for each bit added is much larger than that of INQ counterpart. This means that when the representation range is guaranteed (presented by the bit-width for the basic part in AUSN or the whole bit-width in INQ), it is the resolution (presented by the bit-width for the subdivision part in AUSN) that helps the accuracy to grow.
Especially, we quantized the pruned model of ResNet-18 to show the harmonious cooperation between the AUSN and the pruning method.
Model |
|
AUSN quantization (Ours) | power-of-two(INQ zhou2017incremental) | ||||||||
5bit/Top-1(%) | 4bit/Top-1(%) | 3bit/Top-1(%) | 2bit/Top-1(%) | 5bit/Top-1(%) | 4bit/Top-1(%) | 3bit/Top-1(%) | 2bit/Top-1(%) | ||||
AlexNet | 85.13 | 85.21 | 85.00 | 84.80 | 83.21 | 81.90 | 81.89 | 80.29 | 80.27 | ||
GoogleNet | 90.89 | 90.85 | 90.01 | 89.37 | 89.11 | 88.28 | 88.28 | 88.27 | 88.21 | ||
VGG16 | 93.92 | 93.95 | 93.58 | 93.68 | 92.75 | 91.88 | 91.84 | 91.27 | 91.26 | ||
ResNet18 | 92.22 | 92.25 | 91.83 | 91.80 | 91.46 | 90.47 | 90.31 | 90.14 | 90.10 | ||
ResNet18(pruned) | 94.17 | 94.11 | 94.09 | 93.52 | 92.65 | 92.87 | 92.85 | 92.83 | 88.15 |
5.2 Result of quantized networks on ImageNet
In the experiments on ImageNet, we choose the most widely used models ResNet18 and ResNet50 and compare our result with other quantization schemes. Table 3 shows that the Top-1 accuracy loss in our 5-bit AUSN is less than 0.1%, far better than other schemes.
Method | SR+DR | INT | RQ ST | PACT | QIL | Ours |
Bit-width | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 |
Acc. loss | -10.5 | -2.5 | -1.6 | -0.3 | -0.8 | -0.3 |
Meanwhile, we quantize the weights of pre-trained models without fine-tuning and got the results in Table 3, which means that the time of quantization has significantly been saved. Also, in Table 2, we quantize both weights and activations to optimize the inference process. According to the advantages mentioned above, our AUSN method reached state-of-the-art result in existing quantization schemes.
5.3 Result of quantizing sequential models and Yolo
We explore the effect of the AUSN quantization on the sequential network such as GRU, CRNN zhang2017hello, and TDNN snyder2018x through the Google Speech Commands dataset and the VoxCeleb dataset, and the effect of the AUSN quantization on the target detection task such as YOLOv3-tiny through the COCO datasets, as shown in Table 3. There is few decrease in accuracy (i.e. mAP for Yolov3-tiny in COCO and Top-1 accuracy for sequential models) for quantizing weights from 32bit to 4bit or 5bit.
ImageNet Dataset | |||||||
Method | bit-width | Top-1(%) | Drop in Top-1(%) | Method | bit-width | Top-1(%) | Drop in Top-1(%) |
ResNet50* (baseline: 76.13%) | ResNet18* (baseline: 69.76%) | ||||||
Dual choukroun2019low | 4bit | 70.20 | -5.93 | BWN rastegari2016xnor | 2bit | 60.8 | -8.5 |
INQ zhou2017incremental | 5bit | 74.81 | -1.59 | Dual choukroun2019low | 4bit | 66.63 | -3.13 |
Focused compression zhao2019focused | 5bit | 74.86 | -1.54 | LAPQ nahshan2019loss | 4bit | 62.6 | -7.16 |
ADMM Quantization ye2019progressive | 6bit | 75.93 | -0.2 | Focused compression zhao2019focused | 5bit | 68.36 | -1.40 |
SYMM nayak2019bit | 6bit | 72.58 | -3.55 | UNIQ baskin2018uniq | 5bit | 68.00 | -1.76 |
Biscaled-FxP jain2019biscaled | 6bit | 70.46 | -5.67 | DFQ nagel2019data | 6bit | 66.3 | -3.4 |
V-Q park2018value | 7bit | 75.89 | -0.24 | INT8 jacob2018quantization | 8bit | 67.3 | -2.4 |
INT8 jacob2018quantization | 8bit | 74.9 | -1.5 | RQ louizos2018relaxed | 6bit | 68.6 | -1.16 |
Ours | 4bit | 75.37 | -0.76 | Ours | 4bit | 68.84 | -0.92 |
Ours | 5bit | 76.09 | -0.04 | Ours | 5bit | 69.67 | -0.09 |
Speech Command Dataset | |||||||
Method | bit-width | Top-1(%) | Drop in Top-1(%) | Method | bit-width | Top-1(%) | Drop in Top-1(%) |
GRU (Network cfg : S = 40, N = 154) | GRU (Network cfg : S = 20, N = 400) | ||||||
baseline | 32bit | 93.62 | - | baseline | 32bit | 94.62 | - |
Ours | 4bit | 93.15 | -0.47 | Ours | 4bit | 94.34 | -0.28 |
Ours | 5bit | 93.47 | -0.15 | Ours | 5bit | 94.57 | -0.05 |
Speech Command Dataset | |||||||
Method | bit-width | Top-1(%) | Drop in Top-1(%) | Method | bit-width | Top-1(%) | Drop in Top-1(%) |
CRNN | CRNN | ||||||
baseline | 32bit | 93.50 | - | baseline | 32bit | 94.56 | - |
Ours | 4bit | 93.17 | -0.33 | Ours | 4bit | 94.28 | -0.28 |
Ours | 5bit | 93.38 | -0.12 | Ours | 5bit | 94.48 | -0.08 |
VoxCeleb Dataset | COCO datasets | ||||||
Method | bit-width | Top-1(%) | Drop in Top-1(%) | Method | bit-width | mAP | Drop in Acc. |
TDNN | YOLOv3tiny | ||||||
baseline | 32bit | 80.38 | - | baseline | 32bit | 33.1 | - |
Ours | 4bit | 79.98 | -0.40 | Ours | 4bit | 31.6 | -1.5 |
Ours | 5bit | 80.25 | -0.13 | Ours | 5bit | 32.2 | -0.9 |
-
S: Frame Stride, N: Cells of GRU
-
The Network cfg of CRNN is: C(48,10,4,2,2)-N(60)-N(60)-F(84); the Network cfg of CRNN is: C(100,10,4,2,1)-N(136)-N(136)-F(188), where C: Shape of Convolutional layer, N: Cells of GRU, F: Cells of fully connected layer
5.4 Result of quantized networks deployed on FPGA
Resources |
|
Shifter&Adder | Performance | |||||||||
need decoder | Ours | |||||||||||
1 | ||||||||||||
LUT | 212388 | 187262 | 181248 | 225280 | 212942 | 203712 | 133120 | 112071 | 108544 | 2.0 | ||
FF | 192293 | 143142 | 108729 | 86317 | 512731 | 45729 | 54313 | 45127 | 44032 | 4.4 | ||
Energy Consumption | 4.21W | 3.75W | 3.67W | 4.51W | 4.26W | 4.07W | 2.65W | 2.24W | 2.17W | 1.9 |
-
A/B refers to A-bit input multiplied by B-bit weights
AUSN doesn’t need a decoder. Decoder is necessary only when the control information is encoded into bit-word. AUSN directly “multiply” two exponents composing the weight with the input in parallel by shift operations, whose results is added up (superposition) to the output. AUSN is not only as easy to compute as uniform quantization (like INT8), but also hardware friendly because the expensive multiplication operation is replaced by the simple bit shift and add operations. Compared with the traditional 8-bit multiply-accumulator, the shifter can be realized only by logic units such as LUTs. As shown in Table 4, we found that the quantized model by AUSN not only DSPs are not required, but also saves the LUTs 2.0, FFs 4.4 and power 1.9 by using the Xilinx Vivado HLS suite, which dramatically reduces the hardware cost (The evaluation platform is Xilinx ZCU104).
6 Conclusion
In this paper, we introduced a novel quantization method called AUSN, which combines the advantages of broad representing range of power-of-two quantization and the constant representing resolution of the uniform quantization. This method can quantize both the activation and weight of a pre-trained model to low bit-width (less than 8bit) without accuracy loss. AUSN shows an excellent generality because it can adaptively adjust the number of superposition and the coding scheme given a fixed bit-width. We further design a rounding scheme in AUSN to eliminate the overhead of re-quantization. The thorough experiments show the superiority of AUSN quantization against state-of-the-art methods. Compared with other quantization methods, AUSN quantize the pre-trained model without retraining. Notably, the AUSN quantization is hardware-friendly. The synthesis (see Appendix B for detail) results on FPGA proved that AUSN can effectively reduce the resources and energy consumption. We measure the effectiveness of quantization methods by information loss from the theoretical perspective and analysis of the relationship between the quantization methods and performance of hardware (as in Appendix A).
Broader Impact
The high accuracy of DNNs is achieved by consuming a lot of computing and storage resources, which significantly hinders the application of DNNs in edge devices. Quantization is proposed and widely used for storage and computation reduction. As for AUSN quantization, results in fewer bits representing these operands and corresponding operations by lowering the precision demands of operations and operand, which reduces the cost of datapath and memory. Furthermore, it can also reduce the energy consumption and resources of hardware and expand the accelerator design space of many terminal devices with limited hardware resources. It is worth noting that AUSN quantization is a general quantization method, which can be used in many tasks, such as image recognition, speech recognition, target detection.
Appendix A Theoretical analysis
We explore the choice of quantization in the context of information theory, which enables us to reinterpret quantization as the Minimum Description Length (MDL) problem hinton1993keeping; ullrich2017soft
. Both MDL and quantization can be regarded as a search problem, aiming at finding out the optimal method to compress data with balancing the accuracy and the complexity (including bit width, quantization method, etc.) of the model. We define the loss function to measure the loss of information between the distribution of original weights,
and the distribution of quantized weights, , mainly including the error and the complexity. Specifically:(7) |
where denotes the dataset that and is trained on. indicates the error caused by the misfit between model and quantized model trained on the dataset . The dataset consists of inputs and expected output , can be further expressed as
, which means the probability of achieving the correct output
given the input and model . indicates the complexity of the quantization methods.According to the information theory, is used to measure the average number of extra bits required by encoding , i.e., the distance between them. We then consider as an approximation of the distribution of . Thus, represents KL divergence, which measures the similarity between the distributions of and . For , firstly we divide the original distribution of according to the set of points of the quantized distribution , and then calculate the probability of differences between and :
(8) | ||||
where is the distribution of weight in real number field. is the approximate count of weights according to at . is the set of quantized numbers.
|
Quant |
|
|
|
||||
5bit | Ours | 0.014 | -0.001 | 0.013 | ||||
INQ | 0.004 | 3.23 | 3.234 | |||||
4bit | Ours | 0.062 | 0.13 | 0.192 | ||||
INQ | 0.004 | 3.24 | 3.244 | |||||
3bit | Ours | 0.261 | 0.15 | 0.411 | ||||
INQ | 0.004 | 4.84 | 4.844 | |||||
The smaller value of information loss, the less accuracy loss induced by quantization. In Table 5, the figure in brackets is in the ascending order of information loss. The red label is the best quantization effect.
Appendix B Efficiency Analysis
b.1 Analysis on the Evolution
Most existing quantization methods focus on reducing the bit-width and improve accuracy. However, these approaches ignore the limitation of hardware. As the bit-width of the quantized model decreases, the reductions of the resource and energy consumption manifest themselves differently in various accelerator architectures and arithmetic logics. Fig.
4 describes the evaluation method of the proposed AUSN quantization. Generally speaking, multiplication is more expensive than addition and shift operations in terms of the complexity of digital circuits, the number of transistors and the power consumption as well.
For example, on FPGA, the shift operation implemented with Look Up Tables (LUTs) is more than more efficient in energy consumption444\(https://china.xilinx.com/support/documentation/sw_ {m}anuals/xilinx14_{7}/ug440%
-xilinx-power-estimator.pdf
The power-of-two quantization transforms multiplication operation into shift operation, making the quantized network hardware-friendly to balance the cost of hardware resources. For example, the quantized DNN (the power-of-two quantization quantizes the weights, the input is quantized into the fixed-point number) is deployed on the FPGA, and the multiplication operation can be converted to the shift operation of the fixed-point number. Although shift operation can alleviate the shortage of DSP resources to a certain extent, the accumulation operation still depends on DSP and requires additional hardware overhead for the re-quantization. LUT resources may become a new bottleneck because of the bit-width of input. Moreover, due to the significant rounding error caused by the power-of-two quantization, the accuracy will inevitably decrease, so that the power-of-two quantization can not be further applied.
To further reduce the bit-width and computing resources, the weights and activations both are quantized by the power-of-two quantization, which can further convert the shift operation into the addition operation. However, it will lead to a more significant accuracy decrease. However, AUSN quantization can not only reduces the accuracy loss but also is reasonable for LUT resources.
We can quantize weights and activations using proposed AUSN quantization, and transform re-quantization into AUSN rounding. We can eliminate the intermediate value (i.e., partial sums generated in the calculation of convolutional layers) stored on the Block RAMs (BRAMs) for data storage through the logic gate. It can also assure the final accuracy and make full use of LUT resources. In theory, we can even deploy the DNN quantized by AUSN quantization on FPGA without DSP resources or design a reasonable accelerator to realize parallel computing on convolution layer by LUTs and DSPs.
b.2 Analysis on LUT Resource
The LUT resource consumed by AUSN quantization is far less than that consumed by the shift operation of the fixed-point number as the bit-width of inputs is further reduced. The specific analysis is as follows (take 6bit for both weights and activations as an example).
|
|
To make good use of the resources, we implemented addition or shift operation with 6-input LUT (for Modern Xilinx FPGAs, like whole 6 series, 7series). If inputs and weights are quantized to the fixed-point number and power-of-two, respectively, then multiplication of the input and the weight is implemented by the shift operation, the bit-width of the result of the multiplication is 12 bits (6 bits 6 bits 12 bits).
In contrast, we quantize the inputs and weights by AUSN quantization, which converts the shift operation of the value into addition operation of the power. We divide the addition operation into 3 times, each addition consuming 4 LUTS. Through this, the result only needs 7 bits (6 bits + 6 bits 7 bits). Fig. 5 describes the consumption of LUT resources.
We can found the number of 6-input LUT consumed by the multiplication operation of 6bits implemented by the shift operation on the value (in Fig. 5(a)) is , where the number of 6-input LUT consumed by the addition operation on the power (in Fig. 5(b)) is . AUSN quantization consumes less LUT resources than the shift operation of fixed-point numbers. Remarkably, there is almost no accuracy loss of the quantized model by AUSN quantization.
b.3 The analysis on implementation
We refer to the classic performance evaluation model, called the RoofLine model williams2009roofline, to analyze the quantization methods and the performance gains of the quantized model on the machine. It can be concluded that quantization reduces the accuracy of operations and operands, resulting in fewer bits representing these operands and corresponding operations, which significantly increase the Computation to Communication Ratio () as defined in Equation (9).
(9) | ||||
The RoofLine model is the upper bound of theoretical performance that the model and algorithm can achieve under the limitation of the computing power and bandwidth of the accelerator. In the Fig. 6, the x-axis represents the or operational intensity, which the higher is, the higher the memory efficiency is; the y-axis represents the computing performance of the accelerator; is the boundary. The bandwidth of machine limits the left area of and the computing power of the machine limits the right. The upper limit of bandwidth and the upper limit of computation is determined by the accelerator, called bandwidth roof and computational roof. The space/optimization space provided by the machine is the area on the right side of the bandwidth roof. The actual performance can only be the projection on the bandwidth roof or computational roof if it is above the bandwidth roof or computational roof.
As the model is quantized, the becomes larger, and the position of the model moves to the area to the right of , changing from being limited by bandwidth of accelerator to being limited by computing power of accelerator (transfer from point to point in the Fig. 6). Though the bit-width of weights is further reduced, the performance cannot be further improved due to the limitation of computational roof. For FPGA, the computing core is DSPs. The quantized model by the AUSN converts the multiplication operation to shift or even addition operation. These operations can be realized by the LUTs, which is equivalent to adding a computing core to the FPGA, increasing the theoretical calculation peak of FPGA (the computational roof moves up in the Fig. 6), and providing more design space for further optimization.
Compared with AUSN, the other quantization methods with low bit-width(such as less than 5bit) also can achieve a high compression ratio, it may need to increase the process of decoding. The process of decoding will increase a lot of hardware resources and energy consumption, which is hardware-unfriendly. As shown in Table 4, the 4bit shift operation with indexes requires additional decoders to be deployed on the hardware, resulting in logic resources and power that even exceed the overhead of 8bit multiplication. Therefore, lowering the bit-width does not indicate the increase of actual benefits.