AUSN: Approximately Uniform Quantization by Adaptively Superimposing Non-uniform Distribution for Deep Neural Networks

07/08/2020 ∙ by Liu Fangxin, et al. ∙ Shanghai Jiao Tong University 0

Quantization is essential to simplify DNN inference in edge applications. Existing uniform and non-uniform quantization methods, however, exhibit an inherent conflict between the representing range and representing resolution, and thereby result in either underutilized bit-width or significant accuracy drop. Moreover, these methods encounter three drawbacks: i) the absence of a quantitative metric for in-depth analysis of the source of the quantization errors; ii) the limited focus on the image classification tasks based on CNNs; iii) the unawareness of the real hardware and energy consumption reduced by lowering the bit-width. In this paper, we first define two quantitative metrics, i.e., the Clipping Error and rounding error, to analyze the quantization error distribution. We observe that the boundary- and rounding- errors vary significantly across layers, models and tasks. Consequently, we propose a novel quantization method to quantize the weight and activation. The key idea is to Approximate the Uniform quantization by Adaptively Superposing multiple Non-uniform quantized values, namely AUSN. AUSN is consist of a decoder-free coding scheme that efficiently exploits the bit-width to its extreme, a superposition quantization algorithm that can adapt the coding scheme to different DNN layers, models and tasks without extra hardware design effort, and a rounding scheme that can eliminate the well-known bit-width overflow and re-quantization issues. Theoretical analysis (see Appendix A) and accuracy evaluation on various DNN models of different tasks show the effectiveness and generalization of AUSN. The synthesis (see Appendix B) results on FPGA show 2× reduction of the energy consumption, and 2× to 4× reduction of the hardware resource.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deploying DNN on edge devices, such as Internet-of-Things devices or mobile phones, is challenging on account of the limited computing and storage resources and energy budge provided by these edge devices. For instance, it takes 16 seconds on a mobile to complete an image recognition using VGG16 simonyan2014very, which is intolerable for most applications mao2017modnn. Therefore, it is essential to compress the DNN for lower storage requirements and simpler arithmetic operations.

Among various compression techniques, quantization maps the weight values distributed in the infinite space of real numbers to the finite space of discrete numbers. Existing quantization methods, however, mainly encounter several issues. Uniform quantization, such as INT8 jacob2018quantization, uses an affine function to map real value weights to uniformly distributed integers. This method results in a constant distance between the two adjacent quantized numbers, which denoted as Representing Resolution. The representing resolution becomes coarse as the bit-width decreases, which degrades the model accuracy. Non-uniform quantization, such as power-of-two zhou2017incremental, maps the weight values to exponential space. Non-uniform quantization has a extremely large Representing Range of real number with low bit-width exponents. However, both of them can not balance the relationship between Representing Resolution and Representing Range using a low bit-width. For example, the Representing Resolution of non-uniform distribution is too coarse when the exponent number is large and results in a significant quantization error.

Then, we find that some low bit-width quantization methods have poor generality. They cannot apply to the tasks like object detection and sequential network. For example, LSQ esser2019learned only quantizes the weights representing important image features in the convolution layer. Thus, this method only works for CNN based classification tasks.

Most quantization methods presume that lower bit-width can achieve higher speed up or smaller hardware consumption. The introduced decoder, however, may cause significant hardware and energy overhead. Various accelerator architectures may need different quantization bit-width. For instance, Digital-Signal-Processor (DSP) and CPU with AVX prefer 4-bit and 8-bit quantization because the fundamental Multiply-and-accumulate (MAC) can process 4-bit data. It is desired to build a relationship between the quantization bit-width and the real hardware/energy consumption. Such relationship can guide an efficient quantization given a specific accelerator architecture.

In this work, First, AUSN, what we prose is not simply an algorithm, but a framework, including algorithm, coding, rounding, adaptively adjusting scheme. Second, we follow the principle of minimum accuracy drop and extremely small bit-width, to further salvage redundant bit-width to gain more quantization choices. Contributions of this paper are as follows:

  • We first combines these two quantization methods and proposed a coding scheme without the decoder. The bit-width is divided into two parts, which are the symbol and data. The data part can be further divided into “basic part”, and “subdivision part” for the superposition, which determine the Representing Range and  Representing Resolution, respectively. Then, we use the superposition of multiple numbers non-uniformly quantized with extremely small bit-width, to represent the weights and activations.

  • We explore in-depth reasons for the quantization error in existing quantization methods: Clipping Error and rounding errors. According to these errors, we propose an adaptively adjusting scheme to reallocate the bit-width according to the weight distribution, reducing the quantization error.

  • We design the rounding scheme to solve the problems about the extra computation effort and hardware overhead caused by overflow of bit-width and re-quantization.

  • We verify the quantized CNNs, sequential networks, and Yolov3-tiny by AUSN on various datasets without retraining and achieve excellent results that exceed state-of-the-art accuracy, and we prove the savings brought by AUSN in hardware resources and energy consumption on FPGA (see Appendix B).

  • We measure the effect of quantization on the model through information loss from the information theory. We also disscuss the tradeoff between quantization methods and performance of hardware, which can guide the design of accelerator (see Appendix A).

2 Background

2.1 Uniform quantization

Uniform quantization defines a Representing Range and quantize weights in the range by first applying an affine function on them and then rounding to the closest integer. For example, all the floating-point weights are quantized to an integer in in INT8 quantization jacob2018quantization.

If the original weights of the DNN exceed the Representing Range, they will be quantized to the boundary values; if these quantized weights are of significant importance, the DNN model may suffer from substantial accuracy loss. Uniform quantization  jacob2018quantization; migacz20178

is widely used both in academia and industry because of the guarantee of high accuracy. For example, many open-source frameworks (e.g., Pytorch, TensorFlow, etc.) support INT8 quantization. Although the

Representing Range of INT8 is broad enough for most DNN models, uniform quantization suffers drastic accuracy drop for DNN model with lower bit-width. The fewer quantized numbers are available to cover the whole weights, the farther apart from each other, consequently, which leads to a significant quantization error.

2.2 Non-uniform quantization

The non-uniform quantization methods ye2018unified; mcdanel2019full, similar to clustering, aims at finding some “centers" (i.e., the average value) that can represent the majority of the weight values in the original weight distribution. Therefore, these quantzaition methods need to store these “centers", and the quantized number stores the index of “centers" (such as Deep Compression han2016deep), which causes an obvious resource cost.

Power-of-two quantization johnson2018rethinking; mcdanel2019full quantizes weights to the form of exponents (of power-of-two). The quantized number symbolizes the exponent, , a 3-bit quantized value “111” means that the exponent is 7 and the original value is . It has good Representing Range since the range of quantized weights grows exponentially as the quantization bit-width increases. Besides, the power-of-two scheme is hardware-friendly: the multiplication operation can convert to a shift operation. For example, can be substituted by shifting the bit sequence two bits to the left. Such conversion is of great significance for the DNN hardware accelerator implemented in ASIC or FPGA ding2019req, because a tremendous amount of power- and resource-consuming DSPs in the multiplier can be substituted by the simple-and-effective Look-Up Tables (LUTs). Take Xilinx Artix-7 as an example, this FPGA contains LUTs and DSP Slices, and the LUT resources are hundreds of times of the DSP, LUTs are a promising alternative to the implementation of multiplication.111https://china.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf. Such conversion can typically bring up to 50% resource reductions on Xilinx FPGAs faraone2019addnet.

3 Key idea of AUSN quantization

How close the weights in a quantized model can approach its original value inherently determines its quantization error. We define two metrics to evaluate the quantization error.

  • Rounding Error: The error caused by rounding a weight to the nearest quantized value.

  • Clipping Error: The error caused by clipping weight out of Representing Range to extremum value.

(a) An example showing the shortcoming of existing quantization
(b) An example of double superposition of power-of-two values in AUSN quantization method
Figure 1: The advantages of AUSN over other quantizations: (a) shows the INT8 quantization rounds the weights up/down to the evenly distributed integers. The quantized number in power-of-two quantization is distributed exponentially. Both will cause Clipping Errors and rounding errors, respectively. (b) describes that AUSN eliminates the above errors

Uniform quantization pursues finer Representing Resolution result in the larger Clipping Error with the limited bit-width. The finer-grain Representing Resolution is, the more narrow the Representing Range will be, and verse visa. Thus, the accuracy of the DNN model inevitably decreases as the bit-width reduces. As shown in Fig. 1(a), the values quantized to 2bit by the uniform quantization with the same Representing Resolution can only represent , if a value of has to be quantized to the boundary , result in the significant Clipping Error.

In contrast, non-uniform quantization focus on the Representing Range, which cause the higher Rounding Error. Quantized numbers with 2bit using power-of-two quantization can represent a value in the Representing Range of in Fig. 1(a). However, the coarse-grain Representing Resolution appears when the exponent is large. In this example, a weight value of has to be rounded to or , rendering significant Rounding Error. Also, the power-of-two quantization requires one extra bit to represent the sign of power.

For a given bit-width, there is always a trade-off between Representing Range and Representing Resolution. Our method is also driven by the idea of eliminating these errors. The fundamental objective is to improve the power-of-two quantization to have finer Representing Resolution while keeping its good property of having a wide Representing Range with low bit-width. Most of the weights of CNN are concentrated near zero and few of them have large absolute value as shown in Fig. 2(b). This phenomenon is well-known as the “long-term effect” and is commonly observed in most CNNs. Therefore, it is not worthy of using a large amount of bits to represent all the weights evenly. We find an opportunity to reduce the bit-width of the quantized numbers without accuracy loss.

We leverage the superposition of multiple power-of-two quantized numbers to represent the data, rather than rounding the data to the closest power-of-two. As the example in Fig. 1(b), we can represent two weight values, e.g., and , with the superposition of two or three quantized numbers, i.e., and , which eliminates the rounding error.

However, the weight distribution across the layers of the model is differentwang2019haq. Thus, it is critical to choose the proper number of superposition to make the quantized value “close" to the original value. More important, Our method (as shown in Fig. 3) AUSN can also adaptively allocate the composition of bit-width according to the weight distribution of layers. To combine the averagely high accuracy with finer Representing Resolution from uniform quantization and the broad Representing Range and hardware acceleration from power-of-two method, we subdivide the limited bit-width and find the balance between Clipping Error and Rounding Error.

4 AUSN Quantization

(a) An example for divsion of bit-width
(b) The weight distribution of AlexNet model
Figure 2: The data format and quantized model by AUSN quantization: (a) shows the data format of AUSN quantization. (b) describes the weight distribution of quantized model by AUSN quantization and the original weight distribution of model (the blue line). The black, red and green represent for the values that was not superposed/ superposed once/ superposed twice by power of two’s respectively

4.1 Decoder-free coding scheme for superposition

We divide the bit-width into three parts, which are “sign”,“basic part”, and “subdivision part”222The following solutions discuss the data part (unsigned number).. In Fig. 2(a), taking 6bit as an example, the bit-width of general quantization consists of one sign bit and five data bits, while bitwidth of power-of-two quantization is composed of two sign bits and four data. AUSN separates the original five-bit data into the three-bit “basic” part and two-bit “subdivision” part. By sharing the same sign bit of the value, the superposition of the “basic” part and “subdivision” part can save 1bit. Given the total bit-width, the proposed AUSN coding scheme allocates the bit-width to the “basic” and “subdivision” part to ensure a good trade-off between resolution and the range of weight distribution according to the later adaptive adjustment.

Figure 3: The execution process of AUSN Quantizer

4.2 Quantization scheme with superposition

Note that the value stored in quantized bits is not the quantized number, but the power. If bits are used to indicate the basic part, we have an intuitive power basis , whose powers are all negative. Thus, we can save them as unsigned numbers and save one “sign” bit. In practice, however, the range of may mismatch with that of . Thus, we introduce a scaling operation called “PreConvert” to make the Representing Range of the quantized model closer to the weight distribution of original model, and reduce the Clipping Error. Since the values stored in bit-width are the powers, the scaling operation on the original values is converted to the shift operation on the power.

The specific process of AUSN quantization is preformed as follows:

First, we calculate the power indicating the largest weight of the layer :

(1)

Then, we perform the “PreConvert" operation on :

(2)

We denote the power of basic part after “PreConvert” operation , as , and it satisfies Equation (3), which means we use to approximate the range of .

(3)

Furthermore, for the -th superposition that have bits respectively, we also have the corresponding superposition power . Then, we use Algorithm 1 to derive the superposition of each quantized number .

Input: Weight , maximum number of superposition , power basis for each superposition
Output: The quantized number of ,

1:  Let , current superposition number
2:  while  and  do
3:     
4:      -1  
5:      += 1  
6:  end while
7:  return
Algorithm 1 The procedure for AUSN quantization with Superposition

In the procedure of AUSN quantization with superposition defined (as shown in Fig. 2(b)), a proximate representation of can be expressed as

(4)

Our AUSN algorithm can naturally round the weight to the quantized number and decide whether to superpose. We can also apply Algorithm 1 to the activations.

4.3 Adaptively adjusting the coding scheme

Our AUSN algorithm can naturally round the weight to the quantized number and decide whether to superpose, and also adaptively adjust internal allocation of bit-width to the base part and the subdivision part according to the Clipping Error and Rounding Error

. For example, AUSN quantization with 5bit, the weight distribution of CONV1 layer, the 4bit for the basic part and 1bit for the subdivision part, while 3bit for the basic part and 2bit for the subdivision part in CONV2 layer. Suppose that the initial distribution of the weights in a layer is normal. Without losing universality, we suppose that the weight follows standard normal distribution

and is clipped at . Since quantization distribution is usually evenly distributed about 0, we only consider the positive part. Suppose that AUSN quantization has a Representing Range and a set of quantized numbers , we can use the following equation to present the Clipping Error and Rounding Error :

(5)

First, AUSN initializes the coding scheme according to the given bit-width and quantizes the weights. Then, the Clipping Error and Rounding Error between the quantized weights and original weights is calculated according to euqation(5). Last, based on and , AUSN adjusts the boundary of Representing Range and the set of quantized value (that is, the allocation of bit-width). The above process is iterated until an optimal allocation of bit-width is found.

4.4 AUSN rounding scheme

Although AUSN quantization has effectively solved the problem of the power-of-two quantization in the reduction of accuracy, there still exists two critical issues for almost quantization methods:

  1. Overflow of Bit-width: The output resulted from the convolution operation (Multiply and Accumulate) occupies a larger bit-width to maintain the precision than that of the quantized number. For example, the convolution operation of two matrix consisting of all 8-bit numbers first result in four 16-bit dot-products and then accumulate to the 32-bit sum.

  2. Overhead of Re-quantization: In deep neural networks, the output of the previous layer is the input of the next layer. The outputs have a high likelihood of exceeding the range of the quantized numbers for the input layer. Thus, existing methods have to re-quantize the output values — such re-quantization results in significant cost of computation and resource.

We formulate the above problems as follows. Suppose the next layer only accepts an activation value with a double superposition due to the bit-width limit, but in the process of multiplication with AUSN, the number of superposition will increase, e.g., .Thus, we have to eliminate some power-of-two terms (one term in this example) and round up/down the activation value with a minimized rounding error. Consequently, after the multiplication operation, we can substitute the accumulation operation with rounding scheme, which eliminates the overflow and re-quantization issue.

The basic principle of rounding operation is simple enough and to minimize the rounding error. Given the data in the form of a polynomial of power-of-two, we need to determine the power after rounding. Principle: , , so that or .

The rounding scheme can be categorized into the following four scenarios:

Scenario 1:, wherein is the bit-width of the subdivision part indicating the number of superposition.

Scenario 2:When the condition is satisfied, it’s obvious that .

Scenario 3:If , then .

Scenario 4:If , then or .

Based on the above scenarios, we round the data using the following algorithms.

  • Find the maximum terms in data that can apply Scenario 1.

  • Find pair of terms that can apply Scenario 2.

  • Apply Scenario 3 if the condition is met.

  • If Equation (6) work, carry out the Scenario 4.

(6)

Following the above steps, we can eliminate the number of power-of-two to satisfy and minimize the rounding error. Take the data as example, , that is the quantized number is single superposition of power-of-two. Step1: merge the and get the ; Step 2:merge the and get the ; Then, the condition of step 3 is not met, and thus step 4 is performed: merge the and get the ; Finally, we find the result satisfy the requirement of quantization.

5 Experiments

To evaluate the performance of our algorithm, We quantize several models and verify the accuracy of the models on the different datasets. By using Cifar10 krizhevsky2009learning

and ImageNet 

deng2009imagenet, we compare AUSN with several state-of-the-art quantization methods on CNN. We implement the AUSN quantization using the CNN model structures and pre-trained models in Pytorch library333https://github.com/pytorch/vision/tree/master/torchvision/models.

In addition, we verified the sequential network and the network used in target detection task through the Google Speech Commands dataset, the VoxCeleb dataset and COCO datasets. We compared our results with full-precision baselines models to demonstrate the broad applicability of our approach.

Remarkably, we quantize the pre-trained models without fine-tuning or retraining.

5.1 Result of quantized networks on Cifar10

From the data in table 1, we can see that both the self-adapting shifting and the better resolution brought by subdivision part enhenced our result. For the 2-bit result, both schemes quantize weights into power-of-twos, but our AUSN find a better covering range and thus get a better result. For the result of 3, 4 and 5-bit quantization, our AUSN method retained the basic part as 3 bits and set the subdivision part as 0, 1 and 2 bits, respectively. Compared with the INQ result, we can find that our accuracy progress for each bit added is much larger than that of INQ counterpart. This means that when the representation range is guaranteed (presented by the bit-width for the basic part in AUSN or the whole bit-width in INQ), it is the resolution (presented by the bit-width for the subdivision part in AUSN) that helps the accuracy to grow.

Especially, we quantized the pruned model of ResNet-18 to show the harmonious cooperation between the AUSN and the pruning method.

Model
Baseline
32bit
AUSN quantization (Ours) power-of-two(INQ  zhou2017incremental)
5bit/Top-1(%) 4bit/Top-1(%) 3bit/Top-1(%) 2bit/Top-1(%) 5bit/Top-1(%) 4bit/Top-1(%) 3bit/Top-1(%) 2bit/Top-1(%)
AlexNet 85.13 85.21 85.00 84.80 83.21 81.90 81.89 80.29 80.27
GoogleNet 90.89 90.85 90.01 89.37 89.11 88.28 88.28 88.27 88.21
VGG16 93.92 93.95 93.58 93.68 92.75 91.88 91.84 91.27 91.26
ResNet18 92.22 92.25 91.83 91.80 91.46 90.47 90.31 90.14 90.10
ResNet18(pruned) 94.17 94.11 94.09 93.52 92.65 92.87 92.85 92.83 88.15
Table 1: Accuracy on Cifar10 for typical quantized networks comparison by different bit-width. The last one is the pruned ResNet50 model at 50% sparsity. Compared quantization is INQ .

5.2 Result of quantized networks on ImageNet

In the experiments on ImageNet, we choose the most widely used models ResNet18 and ResNet50 and compare our result with other quantization schemes. Table 3 shows that the Top-1 accuracy loss in our 5-bit AUSN is less than 0.1%, far better than other schemes.

Method SR+DR INT RQ ST PACT QIL Ours
Bit-width 5/5 5/5 5/5 5/5 5/5 5/5
Acc. loss -10.5 -2.5 -1.6 -0.3 -0.8 -0.3
Table 2: ResNet18 on ImageNet. Top-1 accuracy loss (%) with quantized weights and activations by various quantization method, including SR+DR gysel2018ristretto, INT jacob2018quantization, RQ ST louizos2018relaxed, PACT choi2018pact, QIL jung2019learning.

Meanwhile, we quantize the weights of pre-trained models without fine-tuning and got the results in Table 3, which means that the time of quantization has significantly been saved. Also, in Table 2, we quantize both weights and activations to optimize the inference process. According to the advantages mentioned above, our AUSN method reached state-of-the-art result in existing quantization schemes.

5.3 Result of quantizing sequential models and Yolo

We explore the effect of the AUSN quantization on the sequential network such as GRU, CRNN zhang2017hello, and TDNN snyder2018x through the Google Speech Commands dataset and the VoxCeleb dataset, and the effect of the AUSN quantization on the target detection task such as YOLOv3-tiny through the COCO datasets, as shown in Table 3. There is few decrease in accuracy (i.e. mAP for Yolov3-tiny in COCO and Top-1 accuracy for sequential models) for quantizing weights from 32bit to 4bit or 5bit.

ImageNet Dataset
Method bit-width Top-1(%) Drop in Top-1(%) Method bit-width Top-1(%) Drop in Top-1(%)
ResNet50* (baseline: 76.13%) ResNet18* (baseline: 69.76%)
Dual  choukroun2019low 4bit 70.20 -5.93 BWN  rastegari2016xnor 2bit 60.8 -8.5
INQ zhou2017incremental 5bit 74.81 -1.59 Dual  choukroun2019low 4bit 66.63 -3.13
Focused compression  zhao2019focused 5bit 74.86 -1.54 LAPQ nahshan2019loss 4bit 62.6 -7.16
ADMM Quantization  ye2019progressive 6bit 75.93 -0.2 Focused compression  zhao2019focused 5bit 68.36 -1.40
SYMM  nayak2019bit 6bit 72.58 -3.55 UNIQ  baskin2018uniq 5bit 68.00 -1.76
Biscaled-FxP  jain2019biscaled 6bit 70.46 -5.67 DFQ  nagel2019data 6bit 66.3 -3.4
V-Q  park2018value 7bit 75.89 -0.24 INT8  jacob2018quantization 8bit 67.3 -2.4
INT8  jacob2018quantization 8bit 74.9 -1.5 RQ  louizos2018relaxed 6bit 68.6 -1.16
Ours 4bit 75.37 -0.76 Ours 4bit 68.84 -0.92
Ours 5bit 76.09 -0.04 Ours 5bit 69.67 -0.09
Speech Command Dataset
Method bit-width Top-1(%) Drop in Top-1(%) Method bit-width Top-1(%) Drop in Top-1(%)
GRU (Network cfg : S = 40, N = 154) GRU (Network cfg : S = 20, N = 400)
baseline 32bit 93.62 - baseline 32bit 94.62 -
Ours 4bit 93.15 -0.47 Ours 4bit 94.34 -0.28
Ours 5bit 93.47 -0.15 Ours 5bit 94.57 -0.05
Speech Command Dataset
Method bit-width Top-1(%) Drop in Top-1(%) Method bit-width Top-1(%) Drop in Top-1(%)
CRNN CRNN
baseline 32bit 93.50 - baseline 32bit 94.56 -
Ours 4bit 93.17 -0.33 Ours 4bit 94.28 -0.28
Ours 5bit 93.38 -0.12 Ours 5bit 94.48 -0.08
VoxCeleb Dataset COCO datasets
Method bit-width Top-1(%) Drop in Top-1(%) Method bit-width mAP Drop in Acc.
TDNN YOLOv3tiny
baseline 32bit 80.38 - baseline 32bit 33.1 -
Ours 4bit 79.98 -0.40 Ours 4bit 31.6 -1.5
Ours 5bit 80.25 -0.13 Ours 5bit 32.2 -0.9
  • S: Frame Stride, N: Cells of GRU

  • The Network cfg of CRNN is: C(48,10,4,2,2)-N(60)-N(60)-F(84); the Network cfg of CRNN is: C(100,10,4,2,1)-N(136)-N(136)-F(188), where C: Shape of Convolutional layer, N: Cells of GRU, F: Cells of fully connected layer

Table 3: The accuracy comparison of DNNs uses existing methods on the datasets of image recognition, target detection, and speech recognition tasks. The quantized models by AUSN quantization without fine-tuned and any augmentation tricks.

5.4 Result of quantized networks deployed on FPGA

Resources
Multiplier &
Accumulator mcdanel2019full
Shifter&Adder Performance
need decoder Ours
1
LUT 212388 187262 181248 225280 212942 203712 133120 112071 108544 2.0
FF 192293 143142 108729 86317 512731 45729 54313 45127 44032 4.4
Energy Consumption 4.21W 3.75W 3.67W 4.51W 4.26W 4.07W 2.65W 2.24W 2.17W 1.9
  • A/B refers to A-bit input multiplied by B-bit weights

Table 4: The comparison of FPGA resources and energy consumption for 64 x 64 MAC array

AUSN doesn’t need a decoder. Decoder is necessary only when the control information is encoded into bit-word. AUSN directly “multiply” two exponents composing the weight with the input in parallel by shift operations, whose results is added up (superposition) to the output. AUSN is not only as easy to compute as uniform quantization (like INT8), but also hardware friendly because the expensive multiplication operation is replaced by the simple bit shift and add operations. Compared with the traditional 8-bit multiply-accumulator, the shifter can be realized only by logic units such as LUTs. As shown in Table 4, we found that the quantized model by AUSN not only DSPs are not required, but also saves the LUTs 2.0, FFs 4.4 and power 1.9 by using the Xilinx Vivado HLS suite, which dramatically reduces the hardware cost (The evaluation platform is Xilinx ZCU104).

6 Conclusion

In this paper, we introduced a novel quantization method called AUSN, which combines the advantages of broad representing range of power-of-two quantization and the constant representing resolution of the uniform quantization. This method can quantize both the activation and weight of a pre-trained model to low bit-width (less than 8bit) without accuracy loss. AUSN shows an excellent generality because it can adaptively adjust the number of superposition and the coding scheme given a fixed bit-width. We further design a rounding scheme in AUSN to eliminate the overhead of re-quantization. The thorough experiments show the superiority of AUSN quantization against state-of-the-art methods. Compared with other quantization methods, AUSN quantize the pre-trained model without retraining. Notably, the AUSN quantization is hardware-friendly. The synthesis (see Appendix B for detail) results on FPGA proved that AUSN can effectively reduce the resources and energy consumption. We measure the effectiveness of quantization methods by information loss from the theoretical perspective and analysis of the relationship between the quantization methods and performance of hardware (as in Appendix A).

Broader Impact

The high accuracy of DNNs is achieved by consuming a lot of computing and storage resources, which significantly hinders the application of DNNs in edge devices. Quantization is proposed and widely used for storage and computation reduction. As for AUSN quantization, results in fewer bits representing these operands and corresponding operations by lowering the precision demands of operations and operand, which reduces the cost of datapath and memory. Furthermore, it can also reduce the energy consumption and resources of hardware and expand the accelerator design space of many terminal devices with limited hardware resources. It is worth noting that AUSN quantization is a general quantization method, which can be used in many tasks, such as image recognition, speech recognition, target detection.

Appendix A Theoretical analysis

We explore the choice of quantization in the context of information theory, which enables us to reinterpret quantization as the Minimum Description Length (MDL) problem hinton1993keeping; ullrich2017soft

. Both MDL and quantization can be regarded as a search problem, aiming at finding out the optimal method to compress data with balancing the accuracy and the complexity (including bit width, quantization method, etc.) of the model. We define the loss function to measure the loss of information between the distribution of original weights,

and the distribution of quantized weights, , mainly including the error and the complexity. Specifically:

(7)

where denotes the dataset that and is trained on. indicates the error caused by the misfit between model and quantized model trained on the dataset . The dataset consists of inputs and expected output , can be further expressed as

, which means the probability of achieving the correct output

given the input and model . indicates the complexity of the quantization methods.

According to the information theory, is used to measure the average number of extra bits required by encoding , i.e., the distance between them. We then consider as an approximation of the distribution of . Thus, represents KL divergence, which measures the similarity between the distributions of and . For , firstly we divide the original distribution of according to the set of points of the quantized distribution , and then calculate the probability of differences between and :

(8)

where is the distribution of weight in real number field. is the approximate count of weights according to at . is the set of quantized numbers.

Bit-Width
Quant
KL divergence
Accuracy loss
Information loss
5bit Ours 0.014 -0.001 0.013
INQ 0.004 3.23 3.234
4bit Ours 0.062 0.13 0.192
INQ 0.004 3.24 3.244
3bit Ours 0.261 0.15 0.411
INQ 0.004 4.84 4.844
Table 5: The information loss of the AlexNet quantized by AUSN on Cifar10

The smaller value of information loss, the less accuracy loss induced by quantization. In Table 5, the figure in brackets is in the ascending order of information loss. The red label is the best quantization effect.

Appendix B Efficiency Analysis

b.1 Analysis on the Evolution

Most existing quantization methods focus on reducing the bit-width and improve accuracy. However, these approaches ignore the limitation of hardware. As the bit-width of the quantized model decreases, the reductions of the resource and energy consumption manifest themselves differently in various accelerator architectures and arithmetic logics. Fig. 4 describes the evaluation method of the proposed AUSN quantization. Generally speaking, multiplication is more expensive than addition and shift operations in terms of the complexity of digital circuits, the number of transistors and the power consumption as well. For example, on FPGA, the shift operation implemented with Look Up Tables (LUTs) is more than more efficient in energy consumption444\(https://china.xilinx.com/support/documentation/sw_

{m}anuals/xilinx14_{7}/ug440% -xilinx-power-estimator.pdf

\) than the multiplication and accumulation operations conducted with the Digital Signal Processors (DSPs). The DSPs on FPGA are also limited, and the difference between DSP resources and LUT resources is more than two orders of magnitude.

The power-of-two quantization transforms multiplication operation into shift operation, making the quantized network hardware-friendly to balance the cost of hardware resources. For example, the quantized DNN (the power-of-two quantization quantizes the weights, the input is quantized into the fixed-point number) is deployed on the FPGA, and the multiplication operation can be converted to the shift operation of the fixed-point number. Although shift operation can alleviate the shortage of DSP resources to a certain extent, the accumulation operation still depends on DSP and requires additional hardware overhead for the re-quantization. LUT resources may become a new bottleneck because of the bit-width of input. Moreover, due to the significant rounding error caused by the power-of-two quantization, the accuracy will inevitably decrease, so that the power-of-two quantization can not be further applied.

To further reduce the bit-width and computing resources, the weights and activations both are quantized by the power-of-two quantization, which can further convert the shift operation into the addition operation. However, it will lead to a more significant accuracy decrease. However, AUSN quantization can not only reduces the accuracy loss but also is reasonable for LUT resources.

Figure 4: The evaluation of AUSN quantization.

We can quantize weights and activations using proposed AUSN quantization, and transform re-quantization into AUSN rounding. We can eliminate the intermediate value (i.e., partial sums generated in the calculation of convolutional layers) stored on the Block RAMs (BRAMs) for data storage through the logic gate. It can also assure the final accuracy and make full use of LUT resources. In theory, we can even deploy the DNN quantized by AUSN quantization on FPGA without DSP resources or design a reasonable accelerator to realize parallel computing on convolution layer by LUTs and DSPs.

b.2 Analysis on LUT Resource

The LUT resource consumed by AUSN quantization is far less than that consumed by the shift operation of the fixed-point number as the bit-width of inputs is further reduced. The specific analysis is as follows (take 6bit for both weights and activations as an example).

(a) LUT consumed by the shift operation

(b) LUT consumed by the AUSN quantization
Figure 5: The consumption of LUT resources.

To make good use of the resources, we implemented addition or shift operation with 6-input LUT (for Modern Xilinx FPGAs, like whole 6 series, 7series). If inputs and weights are quantized to the fixed-point number and power-of-two, respectively, then multiplication of the input and the weight is implemented by the shift operation, the bit-width of the result of the multiplication is 12 bits (6 bits 6 bits 12 bits).

In contrast, we quantize the inputs and weights by AUSN quantization, which converts the shift operation of the value into addition operation of the power. We divide the addition operation into 3 times, each addition consuming 4 LUTS. Through this, the result only needs 7 bits (6 bits + 6 bits 7 bits). Fig. 5 describes the consumption of LUT resources.

We can found the number of 6-input LUT consumed by the multiplication operation of 6bits implemented by the shift operation on the value (in Fig. 5(a)) is , where the number of 6-input LUT consumed by the addition operation on the power (in Fig. 5(b)) is . AUSN quantization consumes less LUT resources than the shift operation of fixed-point numbers. Remarkably, there is almost no accuracy loss of the quantized model by AUSN quantization.

b.3 The analysis on implementation

Figure 6: Optimization space provided by hardware.

We refer to the classic performance evaluation model, called the RoofLine model williams2009roofline, to analyze the quantization methods and the performance gains of the quantized model on the machine. It can be concluded that quantization reduces the accuracy of operations and operands, resulting in fewer bits representing these operands and corresponding operations, which significantly increase the Computation to Communication Ratio () as defined in Equation (9).

(9)

The RoofLine model is the upper bound of theoretical performance that the model and algorithm can achieve under the limitation of the computing power and bandwidth of the accelerator. In the Fig. 6, the x-axis represents the or operational intensity, which the higher is, the higher the memory efficiency is; the y-axis represents the computing performance of the accelerator; is the boundary. The bandwidth of machine limits the left area of and the computing power of the machine limits the right. The upper limit of bandwidth and the upper limit of computation is determined by the accelerator, called bandwidth roof and computational roof. The space/optimization space provided by the machine is the area on the right side of the bandwidth roof. The actual performance can only be the projection on the bandwidth roof or computational roof if it is above the bandwidth roof or computational roof.

As the model is quantized, the becomes larger, and the position of the model moves to the area to the right of , changing from being limited by bandwidth of accelerator to being limited by computing power of accelerator (transfer from point to point in the Fig. 6). Though the bit-width of weights is further reduced, the performance cannot be further improved due to the limitation of computational roof. For FPGA, the computing core is DSPs. The quantized model by the AUSN converts the multiplication operation to shift or even addition operation. These operations can be realized by the LUTs, which is equivalent to adding a computing core to the FPGA, increasing the theoretical calculation peak of FPGA (the computational roof moves up in the Fig. 6), and providing more design space for further optimization.

Compared with AUSN, the other quantization methods with low bit-width(such as less than 5bit) also can achieve a high compression ratio, it may need to increase the process of decoding. The process of decoding will increase a lot of hardware resources and energy consumption, which is hardware-unfriendly. As shown in Table 4, the 4bit shift operation with indexes requires additional decoders to be deployed on the hardware, resulting in logic resources and power that even exceed the overhead of 8bit multiplication. Therefore, lowering the bit-width does not indicate the increase of actual benefits.