1 Introduction
Deploying DNN on edge devices, such as InternetofThings devices or mobile phones, is challenging on account of the limited computing and storage resources and energy budge provided by these edge devices. For instance, it takes 16 seconds on a mobile to complete an image recognition using VGG16 simonyan2014very, which is intolerable for most applications mao2017modnn. Therefore, it is essential to compress the DNN for lower storage requirements and simpler arithmetic operations.
Among various compression techniques, quantization maps the weight values distributed in the infinite space of real numbers to the finite space of discrete numbers. Existing quantization methods, however, mainly encounter several issues. Uniform quantization, such as INT8 jacob2018quantization, uses an affine function to map real value weights to uniformly distributed integers. This method results in a constant distance between the two adjacent quantized numbers, which denoted as Representing Resolution. The representing resolution becomes coarse as the bitwidth decreases, which degrades the model accuracy. Nonuniform quantization, such as poweroftwo zhou2017incremental, maps the weight values to exponential space. Nonuniform quantization has a extremely large Representing Range of real number with low bitwidth exponents. However, both of them can not balance the relationship between Representing Resolution and Representing Range using a low bitwidth. For example, the Representing Resolution of nonuniform distribution is too coarse when the exponent number is large and results in a significant quantization error.
Then, we find that some low bitwidth quantization methods have poor generality. They cannot apply to the tasks like object detection and sequential network. For example, LSQ esser2019learned only quantizes the weights representing important image features in the convolution layer. Thus, this method only works for CNN based classification tasks.
Most quantization methods presume that lower bitwidth can achieve higher speed up or smaller hardware consumption. The introduced decoder, however, may cause significant hardware and energy overhead. Various accelerator architectures may need different quantization bitwidth. For instance, DigitalSignalProcessor (DSP) and CPU with AVX prefer 4bit and 8bit quantization because the fundamental Multiplyandaccumulate (MAC) can process 4bit data. It is desired to build a relationship between the quantization bitwidth and the real hardware/energy consumption. Such relationship can guide an efficient quantization given a specific accelerator architecture.
In this work, First, AUSN, what we prose is not simply an algorithm, but a framework, including algorithm, coding, rounding, adaptively adjusting scheme. Second, we follow the principle of minimum accuracy drop and extremely small bitwidth, to further salvage redundant bitwidth to gain more quantization choices. Contributions of this paper are as follows:

We first combines these two quantization methods and proposed a coding scheme without the decoder. The bitwidth is divided into two parts, which are the symbol and data. The data part can be further divided into “basic part”, and “subdivision part” for the superposition, which determine the Representing Range and Representing Resolution, respectively. Then, we use the superposition of multiple numbers nonuniformly quantized with extremely small bitwidth, to represent the weights and activations.

We explore indepth reasons for the quantization error in existing quantization methods: Clipping Error and rounding errors. According to these errors, we propose an adaptively adjusting scheme to reallocate the bitwidth according to the weight distribution, reducing the quantization error.

We design the rounding scheme to solve the problems about the extra computation effort and hardware overhead caused by overflow of bitwidth and requantization.

We verify the quantized CNNs, sequential networks, and Yolov3tiny by AUSN on various datasets without retraining and achieve excellent results that exceed stateoftheart accuracy, and we prove the savings brought by AUSN in hardware resources and energy consumption on FPGA (see Appendix B).

We measure the effect of quantization on the model through information loss from the information theory. We also disscuss the tradeoff between quantization methods and performance of hardware, which can guide the design of accelerator (see Appendix A).
2 Background
2.1 Uniform quantization
Uniform quantization defines a Representing Range and quantize weights in the range by first applying an affine function on them and then rounding to the closest integer. For example, all the floatingpoint weights are quantized to an integer in in INT8 quantization jacob2018quantization.
If the original weights of the DNN exceed the Representing Range, they will be quantized to the boundary values; if these quantized weights are of significant importance, the DNN model may suffer from substantial accuracy loss. Uniform quantization jacob2018quantization; migacz20178
is widely used both in academia and industry because of the guarantee of high accuracy. For example, many opensource frameworks (e.g., Pytorch, TensorFlow, etc.) support INT8 quantization. Although the
Representing Range of INT8 is broad enough for most DNN models, uniform quantization suffers drastic accuracy drop for DNN model with lower bitwidth. The fewer quantized numbers are available to cover the whole weights, the farther apart from each other, consequently, which leads to a significant quantization error.2.2 Nonuniform quantization
The nonuniform quantization methods ye2018unified; mcdanel2019full, similar to clustering, aims at finding some “centers" (i.e., the average value) that can represent the majority of the weight values in the original weight distribution. Therefore, these quantzaition methods need to store these “centers", and the quantized number stores the index of “centers" (such as Deep Compression han2016deep), which causes an obvious resource cost.
Poweroftwo quantization johnson2018rethinking; mcdanel2019full quantizes weights to the form of exponents (of poweroftwo). The quantized number symbolizes the exponent, , a 3bit quantized value “111” means that the exponent is 7 and the original value is . It has good Representing Range since the range of quantized weights grows exponentially as the quantization bitwidth increases. Besides, the poweroftwo scheme is hardwarefriendly: the multiplication operation can convert to a shift operation. For example, can be substituted by shifting the bit sequence two bits to the left. Such conversion is of great significance for the DNN hardware accelerator implemented in ASIC or FPGA ding2019req, because a tremendous amount of power and resourceconsuming DSPs in the multiplier can be substituted by the simpleandeffective LookUp Tables (LUTs). Take Xilinx Artix7 as an example, this FPGA contains LUTs and DSP Slices, and the LUT resources are hundreds of times of the DSP, LUTs are a promising alternative to the implementation of multiplication.^{1}^{1}1https://china.xilinx.com/support/documentation/data_sheets/ds190Zynq7000Overview.pdf. Such conversion can typically bring up to 50% resource reductions on Xilinx FPGAs faraone2019addnet.
3 Key idea of AUSN quantization
How close the weights in a quantized model can approach its original value inherently determines its quantization error. We define two metrics to evaluate the quantization error.

Rounding Error: The error caused by rounding a weight to the nearest quantized value.

Clipping Error: The error caused by clipping weight out of Representing Range to extremum value.
Uniform quantization pursues finer Representing Resolution result in the larger Clipping Error with the limited bitwidth. The finergrain Representing Resolution is, the more narrow the Representing Range will be, and verse visa. Thus, the accuracy of the DNN model inevitably decreases as the bitwidth reduces. As shown in Fig. 1(a), the values quantized to 2bit by the uniform quantization with the same Representing Resolution can only represent , if a value of has to be quantized to the boundary , result in the significant Clipping Error.
In contrast, nonuniform quantization focus on the Representing Range, which cause the higher Rounding Error. Quantized numbers with 2bit using poweroftwo quantization can represent a value in the Representing Range of in Fig. 1(a). However, the coarsegrain Representing Resolution appears when the exponent is large. In this example, a weight value of has to be rounded to or , rendering significant Rounding Error. Also, the poweroftwo quantization requires one extra bit to represent the sign of power.
For a given bitwidth, there is always a tradeoff between Representing Range and Representing Resolution. Our method is also driven by the idea of eliminating these errors. The fundamental objective is to improve the poweroftwo quantization to have finer Representing Resolution while keeping its good property of having a wide Representing Range with low bitwidth. Most of the weights of CNN are concentrated near zero and few of them have large absolute value as shown in Fig. 2(b). This phenomenon is wellknown as the “longterm effect” and is commonly observed in most CNNs. Therefore, it is not worthy of using a large amount of bits to represent all the weights evenly. We find an opportunity to reduce the bitwidth of the quantized numbers without accuracy loss.
We leverage the superposition of multiple poweroftwo quantized numbers to represent the data, rather than rounding the data to the closest poweroftwo. As the example in Fig. 1(b), we can represent two weight values, e.g., and , with the superposition of two or three quantized numbers, i.e., and , which eliminates the rounding error.
However, the weight distribution across the layers of the model is differentwang2019haq. Thus, it is critical to choose the proper number of superposition to make the quantized value “close" to the original value. More important, Our method (as shown in Fig. 3) AUSN can also adaptively allocate the composition of bitwidth according to the weight distribution of layers. To combine the averagely high accuracy with finer Representing Resolution from uniform quantization and the broad Representing Range and hardware acceleration from poweroftwo method, we subdivide the limited bitwidth and find the balance between Clipping Error and Rounding Error.
4 AUSN Quantization
4.1 Decoderfree coding scheme for superposition
We divide the bitwidth into three parts, which are “sign”,“basic part”, and “subdivision part”^{2}^{2}2The following solutions discuss the data part (unsigned number).. In Fig. 2(a), taking 6bit as an example, the bitwidth of general quantization consists of one sign bit and five data bits, while bitwidth of poweroftwo quantization is composed of two sign bits and four data. AUSN separates the original fivebit data into the threebit “basic” part and twobit “subdivision” part. By sharing the same sign bit of the value, the superposition of the “basic” part and “subdivision” part can save 1bit. Given the total bitwidth, the proposed AUSN coding scheme allocates the bitwidth to the “basic” and “subdivision” part to ensure a good tradeoff between resolution and the range of weight distribution according to the later adaptive adjustment.
4.2 Quantization scheme with superposition
Note that the value stored in quantized bits is not the quantized number, but the power. If bits are used to indicate the basic part, we have an intuitive power basis , whose powers are all negative. Thus, we can save them as unsigned numbers and save one “sign” bit. In practice, however, the range of may mismatch with that of . Thus, we introduce a scaling operation called “PreConvert” to make the Representing Range of the quantized model closer to the weight distribution of original model, and reduce the Clipping Error. Since the values stored in bitwidth are the powers, the scaling operation on the original values is converted to the shift operation on the power.
The specific process of AUSN quantization is preformed as follows:
First, we calculate the power indicating the largest weight of the layer :
(1) 
Then, we perform the “PreConvert" operation on :
(2) 
We denote the power of basic part after “PreConvert” operation , as , and it satisfies Equation (3), which means we use to approximate the range of .
(3) 
Furthermore, for the th superposition that have bits respectively, we also have the corresponding superposition power . Then, we use Algorithm 1 to derive the superposition of each quantized number .
In the procedure of AUSN quantization with superposition defined (as shown in Fig. 2(b)), a proximate representation of can be expressed as
(4) 
Our AUSN algorithm can naturally round the weight to the quantized number and decide whether to superpose. We can also apply Algorithm 1 to the activations.
4.3 Adaptively adjusting the coding scheme
Our AUSN algorithm can naturally round the weight to the quantized number and decide whether to superpose, and also adaptively adjust internal allocation of bitwidth to the base part and the subdivision part according to the Clipping Error and Rounding Error
. For example, AUSN quantization with 5bit, the weight distribution of CONV1 layer, the 4bit for the basic part and 1bit for the subdivision part, while 3bit for the basic part and 2bit for the subdivision part in CONV2 layer. Suppose that the initial distribution of the weights in a layer is normal. Without losing universality, we suppose that the weight follows standard normal distribution
and is clipped at . Since quantization distribution is usually evenly distributed about 0, we only consider the positive part. Suppose that AUSN quantization has a Representing Range and a set of quantized numbers , we can use the following equation to present the Clipping Error and Rounding Error :(5) 
First, AUSN initializes the coding scheme according to the given bitwidth and quantizes the weights. Then, the Clipping Error and Rounding Error between the quantized weights and original weights is calculated according to euqation(5). Last, based on and , AUSN adjusts the boundary of Representing Range and the set of quantized value (that is, the allocation of bitwidth). The above process is iterated until an optimal allocation of bitwidth is found.
4.4 AUSN rounding scheme
Although AUSN quantization has effectively solved the problem of the poweroftwo quantization in the reduction of accuracy, there still exists two critical issues for almost quantization methods:

Overflow of Bitwidth: The output resulted from the convolution operation (Multiply and Accumulate) occupies a larger bitwidth to maintain the precision than that of the quantized number. For example, the convolution operation of two matrix consisting of all 8bit numbers first result in four 16bit dotproducts and then accumulate to the 32bit sum.

Overhead of Requantization: In deep neural networks, the output of the previous layer is the input of the next layer. The outputs have a high likelihood of exceeding the range of the quantized numbers for the input layer. Thus, existing methods have to requantize the output values — such requantization results in significant cost of computation and resource.
We formulate the above problems as follows. Suppose the next layer only accepts an activation value with a double superposition due to the bitwidth limit, but in the process of multiplication with AUSN, the number of superposition will increase, e.g., .Thus, we have to eliminate some poweroftwo terms (one term in this example) and round up/down the activation value with a minimized rounding error. Consequently, after the multiplication operation, we can substitute the accumulation operation with rounding scheme, which eliminates the overflow and requantization issue.
The basic principle of rounding operation is simple enough and to minimize the rounding error. Given the data in the form of a polynomial of poweroftwo, we need to determine the power after rounding. Principle: , , so that or .
The rounding scheme can be categorized into the following four scenarios:
Scenario 1:, wherein is the bitwidth of the subdivision part indicating the number of superposition.
Scenario 2:When the condition is satisfied, it’s obvious that .
Scenario 3:If , then .
Scenario 4:If , then or .
Based on the above scenarios, we round the data using the following algorithms.

Find the maximum terms in data that can apply Scenario 1.

Find pair of terms that can apply Scenario 2.

Apply Scenario 3 if the condition is met.

If Equation (6) work, carry out the Scenario 4.
(6) 
Following the above steps, we can eliminate the number of poweroftwo to satisfy and minimize the rounding error. Take the data as example, , that is the quantized number is single superposition of poweroftwo. Step1: merge the and get the ; Step 2:merge the and get the ; Then, the condition of step 3 is not met, and thus step 4 is performed: merge the and get the ; Finally, we find the result satisfy the requirement of quantization.
5 Experiments
To evaluate the performance of our algorithm, We quantize several models and verify the accuracy of the models on the different datasets. By using Cifar10 krizhevsky2009learning
and ImageNet
deng2009imagenet, we compare AUSN with several stateoftheart quantization methods on CNN. We implement the AUSN quantization using the CNN model structures and pretrained models in Pytorch library^{3}^{3}3https://github.com/pytorch/vision/tree/master/torchvision/models.In addition, we verified the sequential network and the network used in target detection task through the Google Speech Commands dataset, the VoxCeleb dataset and COCO datasets. We compared our results with fullprecision baselines models to demonstrate the broad applicability of our approach.
Remarkably, we quantize the pretrained models without finetuning or retraining.
5.1 Result of quantized networks on Cifar10
From the data in table 1, we can see that both the selfadapting shifting and the better resolution brought by subdivision part enhenced our result. For the 2bit result, both schemes quantize weights into poweroftwos, but our AUSN find a better covering range and thus get a better result. For the result of 3, 4 and 5bit quantization, our AUSN method retained the basic part as 3 bits and set the subdivision part as 0, 1 and 2 bits, respectively. Compared with the INQ result, we can find that our accuracy progress for each bit added is much larger than that of INQ counterpart. This means that when the representation range is guaranteed (presented by the bitwidth for the basic part in AUSN or the whole bitwidth in INQ), it is the resolution (presented by the bitwidth for the subdivision part in AUSN) that helps the accuracy to grow.
Especially, we quantized the pruned model of ResNet18 to show the harmonious cooperation between the AUSN and the pruning method.
Model 

AUSN quantization (Ours)  poweroftwo(INQ zhou2017incremental)  
5bit/Top1(%)  4bit/Top1(%)  3bit/Top1(%)  2bit/Top1(%)  5bit/Top1(%)  4bit/Top1(%)  3bit/Top1(%)  2bit/Top1(%)  
AlexNet  85.13  85.21  85.00  84.80  83.21  81.90  81.89  80.29  80.27  
GoogleNet  90.89  90.85  90.01  89.37  89.11  88.28  88.28  88.27  88.21  
VGG16  93.92  93.95  93.58  93.68  92.75  91.88  91.84  91.27  91.26  
ResNet18  92.22  92.25  91.83  91.80  91.46  90.47  90.31  90.14  90.10  
ResNet18(pruned)  94.17  94.11  94.09  93.52  92.65  92.87  92.85  92.83  88.15 
5.2 Result of quantized networks on ImageNet
In the experiments on ImageNet, we choose the most widely used models ResNet18 and ResNet50 and compare our result with other quantization schemes. Table 3 shows that the Top1 accuracy loss in our 5bit AUSN is less than 0.1%, far better than other schemes.
Method  SR+DR  INT  RQ ST  PACT  QIL  Ours 
Bitwidth  5/5  5/5  5/5  5/5  5/5  5/5 
Acc. loss  10.5  2.5  1.6  0.3  0.8  0.3 
Meanwhile, we quantize the weights of pretrained models without finetuning and got the results in Table 3, which means that the time of quantization has significantly been saved. Also, in Table 2, we quantize both weights and activations to optimize the inference process. According to the advantages mentioned above, our AUSN method reached stateoftheart result in existing quantization schemes.
5.3 Result of quantizing sequential models and Yolo
We explore the effect of the AUSN quantization on the sequential network such as GRU, CRNN zhang2017hello, and TDNN snyder2018x through the Google Speech Commands dataset and the VoxCeleb dataset, and the effect of the AUSN quantization on the target detection task such as YOLOv3tiny through the COCO datasets, as shown in Table 3. There is few decrease in accuracy (i.e. mAP for Yolov3tiny in COCO and Top1 accuracy for sequential models) for quantizing weights from 32bit to 4bit or 5bit.
ImageNet Dataset  
Method  bitwidth  Top1(%)  Drop in Top1(%)  Method  bitwidth  Top1(%)  Drop in Top1(%) 
ResNet50* (baseline: 76.13%)  ResNet18* (baseline: 69.76%)  
Dual choukroun2019low  4bit  70.20  5.93  BWN rastegari2016xnor  2bit  60.8  8.5 
INQ zhou2017incremental  5bit  74.81  1.59  Dual choukroun2019low  4bit  66.63  3.13 
Focused compression zhao2019focused  5bit  74.86  1.54  LAPQ nahshan2019loss  4bit  62.6  7.16 
ADMM Quantization ye2019progressive  6bit  75.93  0.2  Focused compression zhao2019focused  5bit  68.36  1.40 
SYMM nayak2019bit  6bit  72.58  3.55  UNIQ baskin2018uniq  5bit  68.00  1.76 
BiscaledFxP jain2019biscaled  6bit  70.46  5.67  DFQ nagel2019data  6bit  66.3  3.4 
VQ park2018value  7bit  75.89  0.24  INT8 jacob2018quantization  8bit  67.3  2.4 
INT8 jacob2018quantization  8bit  74.9  1.5  RQ louizos2018relaxed  6bit  68.6  1.16 
Ours  4bit  75.37  0.76  Ours  4bit  68.84  0.92 
Ours  5bit  76.09  0.04  Ours  5bit  69.67  0.09 
Speech Command Dataset  
Method  bitwidth  Top1(%)  Drop in Top1(%)  Method  bitwidth  Top1(%)  Drop in Top1(%) 
GRU (Network cfg^{†} : S = 40, N = 154)  GRU (Network cfg^{†} : S = 20, N = 400)  
baseline  32bit  93.62    baseline  32bit  94.62   
Ours  4bit  93.15  0.47  Ours  4bit  94.34  0.28 
Ours  5bit  93.47  0.15  Ours  5bit  94.57  0.05 
Speech Command Dataset  
Method  bitwidth  Top1(%)  Drop in Top1(%)  Method  bitwidth  Top1(%)  Drop in Top1(%) 
CRNN ^{‡}  CRNN ^{‡}  
baseline  32bit  93.50    baseline  32bit  94.56   
Ours  4bit  93.17  0.33  Ours  4bit  94.28  0.28 
Ours  5bit  93.38  0.12  Ours  5bit  94.48  0.08 
VoxCeleb Dataset  COCO datasets  
Method  bitwidth  Top1(%)  Drop in Top1(%)  Method  bitwidth  mAP  Drop in Acc. 
TDNN  YOLOv3tiny  
baseline  32bit  80.38    baseline  32bit  33.1   
Ours  4bit  79.98  0.40  Ours  4bit  31.6  1.5 
Ours  5bit  80.25  0.13  Ours  5bit  32.2  0.9 

S: Frame Stride, N: Cells of GRU

The Network cfg of CRNN is: C(48,10,4,2,2)N(60)N(60)F(84); the Network cfg of CRNN is: C(100,10,4,2,1)N(136)N(136)F(188), where C: Shape of Convolutional layer, N: Cells of GRU, F: Cells of fully connected layer
5.4 Result of quantized networks deployed on FPGA
Resources 

Shifter&Adder  Performance  
need decoder  Ours  
^{1}  
LUT  212388  187262  181248  225280  212942  203712  133120  112071  108544  2.0  
FF  192293  143142  108729  86317  512731  45729  54313  45127  44032  4.4  
Energy Consumption  4.21W  3.75W  3.67W  4.51W  4.26W  4.07W  2.65W  2.24W  2.17W  1.9 

A/B refers to Abit input multiplied by Bbit weights
AUSN doesn’t need a decoder. Decoder is necessary only when the control information is encoded into bitword. AUSN directly “multiply” two exponents composing the weight with the input in parallel by shift operations, whose results is added up (superposition) to the output. AUSN is not only as easy to compute as uniform quantization (like INT8), but also hardware friendly because the expensive multiplication operation is replaced by the simple bit shift and add operations. Compared with the traditional 8bit multiplyaccumulator, the shifter can be realized only by logic units such as LUTs. As shown in Table 4, we found that the quantized model by AUSN not only DSPs are not required, but also saves the LUTs 2.0, FFs 4.4 and power 1.9 by using the Xilinx Vivado HLS suite, which dramatically reduces the hardware cost (The evaluation platform is Xilinx ZCU104).
6 Conclusion
In this paper, we introduced a novel quantization method called AUSN, which combines the advantages of broad representing range of poweroftwo quantization and the constant representing resolution of the uniform quantization. This method can quantize both the activation and weight of a pretrained model to low bitwidth (less than 8bit) without accuracy loss. AUSN shows an excellent generality because it can adaptively adjust the number of superposition and the coding scheme given a fixed bitwidth. We further design a rounding scheme in AUSN to eliminate the overhead of requantization. The thorough experiments show the superiority of AUSN quantization against stateoftheart methods. Compared with other quantization methods, AUSN quantize the pretrained model without retraining. Notably, the AUSN quantization is hardwarefriendly. The synthesis (see Appendix B for detail) results on FPGA proved that AUSN can effectively reduce the resources and energy consumption. We measure the effectiveness of quantization methods by information loss from the theoretical perspective and analysis of the relationship between the quantization methods and performance of hardware (as in Appendix A).
Broader Impact
The high accuracy of DNNs is achieved by consuming a lot of computing and storage resources, which significantly hinders the application of DNNs in edge devices. Quantization is proposed and widely used for storage and computation reduction. As for AUSN quantization, results in fewer bits representing these operands and corresponding operations by lowering the precision demands of operations and operand, which reduces the cost of datapath and memory. Furthermore, it can also reduce the energy consumption and resources of hardware and expand the accelerator design space of many terminal devices with limited hardware resources. It is worth noting that AUSN quantization is a general quantization method, which can be used in many tasks, such as image recognition, speech recognition, target detection.
Appendix A Theoretical analysis
We explore the choice of quantization in the context of information theory, which enables us to reinterpret quantization as the Minimum Description Length (MDL) problem hinton1993keeping; ullrich2017soft
. Both MDL and quantization can be regarded as a search problem, aiming at finding out the optimal method to compress data with balancing the accuracy and the complexity (including bit width, quantization method, etc.) of the model. We define the loss function to measure the loss of information between the distribution of original weights,
and the distribution of quantized weights, , mainly including the error and the complexity. Specifically:(7) 
where denotes the dataset that and is trained on. indicates the error caused by the misfit between model and quantized model trained on the dataset . The dataset consists of inputs and expected output , can be further expressed as
, which means the probability of achieving the correct output
given the input and model . indicates the complexity of the quantization methods.According to the information theory, is used to measure the average number of extra bits required by encoding , i.e., the distance between them. We then consider as an approximation of the distribution of . Thus, represents KL divergence, which measures the similarity between the distributions of and . For , firstly we divide the original distribution of according to the set of points of the quantized distribution , and then calculate the probability of differences between and :
(8)  
where is the distribution of weight in real number field. is the approximate count of weights according to at . is the set of quantized numbers.

Quant 




5bit  Ours  0.014  0.001  0.013  
INQ  0.004  3.23  3.234  
4bit  Ours  0.062  0.13  0.192  
INQ  0.004  3.24  3.244  
3bit  Ours  0.261  0.15  0.411  
INQ  0.004  4.84  4.844  
The smaller value of information loss, the less accuracy loss induced by quantization. In Table 5, the figure in brackets is in the ascending order of information loss. The red label is the best quantization effect.
Appendix B Efficiency Analysis
b.1 Analysis on the Evolution
Most existing quantization methods focus on reducing the bitwidth and improve accuracy. However, these approaches ignore the limitation of hardware. As the bitwidth of the quantized model decreases, the reductions of the resource and energy consumption manifest themselves differently in various accelerator architectures and arithmetic logics. Fig. 4 describes the evaluation method of the proposed AUSN quantization. Generally speaking, multiplication is more expensive than addition and shift operations in terms of the complexity of digital circuits, the number of transistors and the power consumption as well. For example, on FPGA, the shift operation implemented with Look Up Tables (LUTs) is more than more efficient in energy consumption^{4}^{4}4\(https://china.xilinx.com/support/documentation/sw_
{m}anuals/xilinx14_{7}/ug440% xilinxpowerestimator.pdf
\) than the multiplication and accumulation operations conducted with the Digital Signal Processors (DSPs). The DSPs on FPGA are also limited, and the difference between DSP resources and LUT resources is more than two orders of magnitude.The poweroftwo quantization transforms multiplication operation into shift operation, making the quantized network hardwarefriendly to balance the cost of hardware resources. For example, the quantized DNN (the poweroftwo quantization quantizes the weights, the input is quantized into the fixedpoint number) is deployed on the FPGA, and the multiplication operation can be converted to the shift operation of the fixedpoint number. Although shift operation can alleviate the shortage of DSP resources to a certain extent, the accumulation operation still depends on DSP and requires additional hardware overhead for the requantization. LUT resources may become a new bottleneck because of the bitwidth of input. Moreover, due to the significant rounding error caused by the poweroftwo quantization, the accuracy will inevitably decrease, so that the poweroftwo quantization can not be further applied.
To further reduce the bitwidth and computing resources, the weights and activations both are quantized by the poweroftwo quantization, which can further convert the shift operation into the addition operation. However, it will lead to a more significant accuracy decrease. However, AUSN quantization can not only reduces the accuracy loss but also is reasonable for LUT resources.
We can quantize weights and activations using proposed AUSN quantization, and transform requantization into AUSN rounding. We can eliminate the intermediate value (i.e., partial sums generated in the calculation of convolutional layers) stored on the Block RAMs (BRAMs) for data storage through the logic gate. It can also assure the final accuracy and make full use of LUT resources. In theory, we can even deploy the DNN quantized by AUSN quantization on FPGA without DSP resources or design a reasonable accelerator to realize parallel computing on convolution layer by LUTs and DSPs.
b.2 Analysis on LUT Resource
The LUT resource consumed by AUSN quantization is far less than that consumed by the shift operation of the fixedpoint number as the bitwidth of inputs is further reduced. The specific analysis is as follows (take 6bit for both weights and activations as an example).
To make good use of the resources, we implemented addition or shift operation with 6input LUT (for Modern Xilinx FPGAs, like whole 6 series, 7series). If inputs and weights are quantized to the fixedpoint number and poweroftwo, respectively, then multiplication of the input and the weight is implemented by the shift operation, the bitwidth of the result of the multiplication is 12 bits (6 bits 6 bits 12 bits).
In contrast, we quantize the inputs and weights by AUSN quantization, which converts the shift operation of the value into addition operation of the power. We divide the addition operation into 3 times, each addition consuming 4 LUTS. Through this, the result only needs 7 bits (6 bits + 6 bits 7 bits). Fig. 5 describes the consumption of LUT resources.
We can found the number of 6input LUT consumed by the multiplication operation of 6bits implemented by the shift operation on the value (in Fig. 5(a)) is , where the number of 6input LUT consumed by the addition operation on the power (in Fig. 5(b)) is . AUSN quantization consumes less LUT resources than the shift operation of fixedpoint numbers. Remarkably, there is almost no accuracy loss of the quantized model by AUSN quantization.
b.3 The analysis on implementation
We refer to the classic performance evaluation model, called the RoofLine model williams2009roofline, to analyze the quantization methods and the performance gains of the quantized model on the machine. It can be concluded that quantization reduces the accuracy of operations and operands, resulting in fewer bits representing these operands and corresponding operations, which significantly increase the Computation to Communication Ratio () as defined in Equation (9).
(9)  
The RoofLine model is the upper bound of theoretical performance that the model and algorithm can achieve under the limitation of the computing power and bandwidth of the accelerator. In the Fig. 6, the xaxis represents the or operational intensity, which the higher is, the higher the memory efficiency is; the yaxis represents the computing performance of the accelerator; is the boundary. The bandwidth of machine limits the left area of and the computing power of the machine limits the right. The upper limit of bandwidth and the upper limit of computation is determined by the accelerator, called bandwidth roof and computational roof. The space/optimization space provided by the machine is the area on the right side of the bandwidth roof. The actual performance can only be the projection on the bandwidth roof or computational roof if it is above the bandwidth roof or computational roof.
As the model is quantized, the becomes larger, and the position of the model moves to the area to the right of , changing from being limited by bandwidth of accelerator to being limited by computing power of accelerator (transfer from point to point in the Fig. 6). Though the bitwidth of weights is further reduced, the performance cannot be further improved due to the limitation of computational roof. For FPGA, the computing core is DSPs. The quantized model by the AUSN converts the multiplication operation to shift or even addition operation. These operations can be realized by the LUTs, which is equivalent to adding a computing core to the FPGA, increasing the theoretical calculation peak of FPGA (the computational roof moves up in the Fig. 6), and providing more design space for further optimization.
Compared with AUSN, the other quantization methods with low bitwidth(such as less than 5bit) also can achieve a high compression ratio, it may need to increase the process of decoding. The process of decoding will increase a lot of hardware resources and energy consumption, which is hardwareunfriendly. As shown in Table 4, the 4bit shift operation with indexes requires additional decoders to be deployed on the hardware, resulting in logic resources and power that even exceed the overhead of 8bit multiplication. Therefore, lowering the bitwidth does not indicate the increase of actual benefits.