RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions

This work proposes a novel Deep Neural Network (DNN) quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach. Specifically, this is the first effort to assign mixed quantization schemes and multiple precisions within layers – among rows of the DNN weight matrix, for simplified operations in hardware inference, while preserving accuracy. Furthermore, this paper makes a different observation from the prior work that the quantization error does not necessarily exhibit the layer-wise sensitivity, and actually can be mitigated as long as a certain portion of the weights in every layer are in higher precisions. This observation enables layer-wise uniformality in the hardware implementation towards guaranteed inference acceleration, while still enjoying row-wise flexibility of mixed schemes and multiple precisions to boost accuracy. The candidates of schemes and precisions are derived practically and effectively with a highly hardware-informative strategy to reduce the problem search space. With the offline determined ratio of different quantization schemes and precisions for all the layers, the RMSMP quantization algorithm uses the Hessian and variance-based method to effectively assign schemes and precisions for each row. The proposed RMSMP is tested for the image classification and natural language processing (BERT) applications and achieves the best accuracy performance among state-of-the-arts under the same equivalent precisions. The RMSMP is implemented on FPGA devices, achieving 3.65x speedup in the end-to-end inference time for ResNet-18 on ImageNet, compared with the 4-bit Fixed-point baseline.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/30/2021

ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization framework for FPGA

This work targets the commonly used FPGA (field-programmable gate array)...
09/16/2020

MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network Quantization Framework

With the tremendous success of deep learning, there exists imminent need...
12/08/2020

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Deep Neural Networks (DNNs) have achieved extraordinary performance in v...
11/10/2019

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks

Quantization is an effective method for reducing memory footprint and in...
04/13/2020

Technical Report: NEMO DNN Quantization for Deployment Model

This technical report aims at defining a formal framework for Deep Neura...
01/17/2022

UWC: Unit-wise Calibration Towards Rapid Network Compression

This paper introduces a post-training quantization (PTQ) method achievin...
08/19/2020

Channel-wise Hessian Aware trace-Weighted Quantization of Neural Networks

Second-order information has proven to be very effective in determining ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With tremendous success of the deep learning or deep neural networks (DNNs), there exist urgent needs for deployments of inference models onto edge-computing platforms/devices. As a major type of model compression technique, the DNN quantization becomes an essential method to reduce the computation, memory, and storage requirements in on-device inference, especially for platforms with capability of customized architecture design, such as FPGA devices and ASIC chips. Generally speaking, DNN quantization learns DNN models in low bit-width representation with accuracy performance close to that of the full-precision models, while accelerating inference speed.

Figure 1:

The proposed DNN quantization framework with row-wise mixed schemes and multiple precisions, which assigns quantization scheme and precision to filters of the weight tensor (or rows of the weight matrix).

It features (i) layer-wise uniformality to fulfill the requirement of practical hardware implementation, (ii) row-wise flexibility for mixed schemes and multiple precisions, (iii) hardware-informative selection of candidate schemes and precisions (bit-widths) for significantly reducing the algorithm search space, and (iv) superior accuracy performance among the state-of-the-arts.

Various quantization schemes have been investigated including binary [5, 6, 26, 22], ternary[20, 15, 39], Fixed-point (Fixed) [38, 4, 12, 16, 3, 11], Power-of-Two (PoT) [24, 37, 19, 36], Additive Power-of-Two (APoT) [21], etc. Those schemes have diverse accuracy and hardware performance. Binary and ternary significantly reduce computation by eliminating multiplication operations, but experience relatively large accuracy loss ( in general). Low bit-width Fixed quantization has better accuracy performance. For example, 4-bit Fixed can achieve negligible accuracy loss comparing with its 32-bit floating-point counterpart, although it still needs multiplication operations during inference computation.

Different from binary, ternary, and Fixed, PoT is a non-linear quantization scheme, where with quantization levels as power-of-two numbers, multiplications can be replaced with bit shifting operations, thereby reducing computation to speedup inference. However, PoT still results in moderate accuracy loss (), which is due to the rigid resolution issue [21] that exhibits a high resolution around the mean and a low resolution at the tails of weight distribution and that cannot be resolved even with higher bit-width. Therefore, APoT was proposed by representing quantization levels as the sum of multiple power-of-two numbers, overcoming the rigid resolution issue of PoT, while enjoying non-multiplication operations.

While exploring various quantization schemes, people found that the first and the last layers are of importance for preserving accuracy, and therefore in the above-mentioned works, the first and the last layers are either unquantized (in 32-bit floating-point) or quantized with no fewer than 8 bits. Motivated by this, the layer-wise multi-precision quantization has been well investigated in [31, 33, 29, 10, 9, 34, 28], where various methods/algorithms have been used to deal with the large search space of the precision assignment for both weights and activations in each layer.

This work proposes a novel DNN quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach. Figure 1 illustrates the proposed quantization framework on the DNN (a) weight tensor and (b) weight matrix. Specifically, each filter in the weight tensor or each row in the weight matrix is assigned a combination of quantization scheme and precision. The candidates of schemes and precisions are derived practically for facilitating hardware implementation, and effectively for speeding up inference with the given resources on the hardware platforms/devices, while with the capability to preserve the accuracy as the unquantized (32-bit floating-point) models. This highly hardware-informative quantization strategy significantly reduces the search space of the DNN quantization problem, making our framework distinctive from existing multi-precision quantization works.

This is the first effort to apply mixed quantization schemes and multiple precisions within layers, targeting for simplified operations in hardware inference, while preserving the accuracy. Specifically, two quantization schemes i.e., Power-of-Two (PoT) and Fixed-point (Fixed), and two precisions i.e., 4-bit and 8-bit are adopted and explored for quantization on weights and activations, to reduce inference computation and preserve accuracy. Furthermore, as demonstrated by the previous works that either implicitly use higher precisions for the first and the last layers, or explicitly assign multiple precisions to different layers, it is essential to use higher precisions at least for parts of the model, to boost the accuracy of quantized models close to that of the full-precision models. However, different from existing works, this paper makes the observation that this does not necessarily relate to layer-wise sensitivity, instead, as long as a certain portion of the weights in every layer use higher precisions, the quantization error can be mitigated. This observation enables layer-wise uniformality towards practical hardware implementation of quantized models, while still enjoying row-wise flexibility of mixed schemes and multiple precision. The contributions of our quantization framework are summarized as follows:

  • A novel row-wise mixed-scheme and multiple-precision quantization approach.

  • A highly hardware-informative solution strategy significantly reducing the problem search space.

  • The best accuracy performance when under the same equivalent precision as the existing works.

  • The significant inference speedup on real devices, comparing to the network-wise uniform low-bit quantization i.e., the speed upper bound of the popular layer-wise multi-precision approaches.

2 Related Work

2.1 Quantization Scheme

2.1.1 Linear Quantization Schemes

Linear quantization schemes with uniform quantization levels include binary, ternary, and fixed-point. Both binary and ternary quantization use extremely low precision for DNN models, i.e., binarized (with values of

) or ternarized (with values of ) levels. Representative binary quantization methods include Binaryconnect [5], Binarized Neural Network (BNN) [6]

, XNOR-net 

[26], and ABC-Net [22]. With weights constrained to , multiplications can be replaced by additions/subtractions, which can even be eliminated by using XNOR and AND operations if activations are quantized to binary as well. Ternary quantization schemes are implemented in TWN [20], TTQ [39], and [15]. Ternary networks also benefit from non-multiplication operations. Although binary and ternary quantization can significantly reduce computation by simplifying operations, they introduce relatively large accuracy loss. From the above studies, accuracy typically degrades by under binary, and for ternary.

Fixed-point (Fixed) quantization schemes use more bits than binary and ternary to preserve accuracy, and have been implemented with different methods/algorithms. DoReFa-Net [38] first explored it by introducing hyperbolic tangent transformation to weights and activations, with scaling factors to minimize quantization error. PACT [4] improved this method by adding a parameterized clipping threshold to activations. DSQ [12] developed differentiable soft quantization, which evolves training method to gradually approximate the uniform quantizer. QIL [16] parameterized the quantization interval and trained it with task loss, avoiding access to the original training data. L2Q [3] introduced data distribution loss during training to minimize quantization error. LSQ [11] proposed a differentiable method to learn the quantizer for each layer jointly with parameters. Generally, 4-bit fixed-point quantized models have negligible accuracy loss comparing with 32-bit floating-point models, although cannot get rid of the expensive multiplications.

2.1.2 Non-Linear Quantization Schemes

Non-linear quantization schemes use non-uniform quantization levels, such as the Power-of-Two (PoT) scheme, where quantization levels are power-of-two numbers. The multiplication of a weight (in PoT) and an input (in Fixed) can be replaced with the bit shifting operation. And therefore PoT can have higher speedup than the Fixed scheme for the DNN inference. PoT quantization was adopted first in LogQuant [24], and then with model accuracy enhancement techniques in INQ [37], extremely low bit neural network with ADMM [19], and LQ-Nets [36]. However, PoT suffers from the rigid resolution phenomenon, and cannot attain higher model accuracy even with increased bit-width. Particularly, 4-bit PoT quantization will result in accuracy loss of .

To overcome the limitation, Additive Power-of-Two (APoT) [21] represents quantization levels as the sum of multiple power-of-two numbers to mitigate the accuracy loss. Then the multiplication of an APoT quantized weight and a Fixed input can be replaced with bit shifting operations and additions. Furthermore, MSQ [2] leverages a mixture of a variant APoT and Fixed to maximize the resource utilization on FPGA devices, while in this work, we use mixed schemes of PoT and Fixed, achieving even higher accuracy and further reducing inference computation. All of the above works use the single-precision quantization.

2.2 Single-Precision vs Multi-Precision

A common practice in the above-mentioned quantization works is to keep the first and last layers unquantized (in 32-bit floating-pointing for weights/activations i.e., W32A32) or quantized with no fewer than 8 bits, with the purpose of preserving accuracy, which was first proposed in [13]. Spurred by that, another line of works  [31, 33, 29, 10, 9, 34, 28] explore the layer-wise multi-precision quantization approach to boost the accuracy of quantized models.

To deal with the large search space of layer-wise precision assignment for both weights and activations, different methods/algorithms have been leveraged. HAQ [31]

uses the reinforcement learning to train an agent that decides the bit-width of each layer. DNAS 

[33] proposes a multi-precision quantization-aware training technique, where the bit-width search is converted into a network architecture search. Mixed Precision DNNs [29]

shows that bit-widths can be determined by learnable parameters, whose gradients are estimated using Straight Through Estimator (STE). HAWQ 

[10], HAWQ-V2 [9], and PyHessian [34] solve Hessian matrix to determine bit-width for each layer. The general idea is to assign more bits to layers that are sensitive to quantization error. Q-BERT [28] also uses Hessian-based multi-precision quantization to the BERT  [7] model for the natural language processing application.

There are two frameworks applying multi-precision within layers. RVQuant [25] quantizes a small percentage of weights with a higher precision irregularly within layers, and therefore could not achieve inference speedup comparing with a network-wise uniform low-bit quantization. AutoQ [23] trains a reinforcement learning agent to assign precisions down to the kernel level, which is computationally expensive. None of them considers mixed schemes.

3 Proposed RMSMP Quantization

This section discusses the proposed RMSMP quantization framework with a highly hardware-informative strategy, by introducing the scheme selection, precision exploration, and the algorithm solution.

3.1 Mixed Schemes for Simplified Operations

The Fixed-point (Fixed) quantization scheme has superior accuracy performance, and the Power-of-Two (PoT) is the most computationally efficient quantization scheme (with still acceptable accuracy performance) to speedup inference since multiplications can be replaced by bit shifting operations. Therefore, this work proposes a novel row-wise mixed-scheme quantization approach with Fixed for preserving accuracy and PoT for reducing computation of inference. Specifically, for each row in the weight matrix or each filter in the weight tensor, the weights are either quantized into the Fixed scheme or the PoT scheme. The row-wise scheme assignment instead of a layer-wise scheme assignment is used, because the hardware inference execution is conducted layer by layer on the same pieces of computing resource – i.e., the GEMM (general matrix multiply) core for processing Fixed weights, and i.e., the GEMM core for processing PoT weights. (Note that in PoT, activations are also quantized into Fixed to support the bit shifting operation in replacement of multiplication.)

The ratio of the two schemes is fixed based on the hardware characteristics, and is the same for different layers to keep the layer-wise uniformality in hardware design such that inference speedup can be obtained in the layer-by-layer inference execution. In ASIC chips, it is desirable to maximize the portion of the PoT quantized rows to the extent that it does not result in accuracy loss. In FPGA devices, there are heterogeneous computing resources i.e., DSPs to implement the core and LUTs to implement the core. Therefore, the ratio of the two schemes could be OFFLINE determined, such that the LUT utilization is high (while with 100% DSP utilization) to speed up inference and also it should not hurt accuracy.

In the training algorithm of our RMSMP quantization, the row-wise scheme assignment is based on the weight distribution of the particular row. If the weight distribution has a smaller variance, the PoT scheme is preferred; otherwise, the Fixed scheme is preferred. The threshold on the variance can be used based on the offline determined ratio of the two schemes.

3.1.1 Fixed-Point (Fixed) Quantizer

The Fixed scheme uses uniform quantization levels, with the quantized weights mapped as a scaling factor times the quantization levels. With -bit Fixed, the quantized weights should be from the following set:

(1)

where denotes the scaling factor. Then the quantizer function from a 32-bit floating-point weight to a Fixed quantized weight is

(2)

where shifts a value within into the range of , e.g., , and clips according to

(3)

3.1.2 Power-of-Two (PoT) Quantizer

The PoT scheme uses power-of-two numbers as quantization levels. With -bit PoT, the quantized weights should be from the following set:

(4)

Then the quantizer function from a 32-bit floating point weight to a PoT quantized weight is

(5)

where .

3.2 Multiple Precisions for Boosting Accuracy

Figure 2: The row-wise mixed-scheme and multiple-precision quantization. The PoT scheme uses only the 4-bit precision for weights/activations i.e., PoT-W4A4, and the Fixed scheme uses two precisions i.e., Fixed-W4A4 and Fixed-W8A4.

Motivated by the previous work that either implicitly use higher precisions for the first and the last layers, or explicitly assign multiple precisions to different layers, it is essential to use higher precisions for some parts of the model to boost the accuracy. However, the layer-wise multi-precision approach does not compatible with the layer-by-layer inference execution on the same pieces of hardware. In other words, the layer-wise multi-precision approach incurs large implementation overhead, which may neutralize the expected inference speedup.

Figure 2 demonstrates the proposed multi-precision approach aligned with the row-wise mixed schemes. The majority of the rows use the 4-bit precision for weights/activations i.e., PoT-W4A4 and Fixed-W4A4, because 2-bit has large accuracy loss and 3-bit is not suitable for hardware implementation, which prefers operands in 2-bit, 4-bit, 8-bit, etc. To boost accuracy, a higher precision with 8-bit weights and 4-bit activations is used on the Fixed scheme, i.e., Fixed-W8A4. The PoT scheme is not applied the higher precision because of its rigid resolution issue. The ratio of PoT-W4A4 : Fixed-W4A4 : Fixed-W8A4 can be determined offline, and is the same across different layers, in order to keep the layer-wise uniformality in the hardware implementation, such that little overhead is incurred during the layer-by-layer inference execution and inference speedup can be guaranteed. In general, a larger portion of the PoT-W4A4 can help with inference speedup, but will degrade accuracy. By using a small portion of Fixed-W8A4, the accuracy degradation due to the PoT scheme can be mitigated, as shown in Figure 3. This work fixes the percentage of Fixed-W8A4 as 5%, as it is enough to mitigate the accuracy degradation and an even smaller percentage only results in marginal further speedup of inference.

Figure 3:

The effect of PoT-W4A4 ratio on the accuracy of ResNet-50, ResNet-18, and MobileNet-V2 models on CIFAR-10, CIFAR-100, and ImageNet datasets.

In the top row, only Fixed-W4A4 is mixed with PoT-W4A4. In the bottom row, 5% of Fixed-W8A4 is used, besides PoT-W4A4 and Fixed-W4A4.

3.3 RMSMP Quantization Algorithm

input : 32-bit floating-point DNN model with weights ; Ratio () of
output : Quantized model
1 ;
2 foreach layer  do
3        foreach filter in  do
               // power iteration
4              

Initialize random vector

;
5               foreach } do
6                      Normalize ;
                      ; // Eq. 8
7                     
8              ;
9              
10       if  in top 5% then
11               ;
12       else
13               Sort by variance, get threshold by ;
14               if variance lower than ;
15               if variance higher than ;
16              
17       
18foreach epoch do
19        foreach batch do
               // forward propagation
20               foreach layer i do
21                      ;
              // back propagation
               ; // STE Eq. 6
22              
23       
Return ;
Algorithm 1 RMSMP Quantization
Quantization MobileNet-v2 Accuracy (%) ResNet-18 Accuracy (%) ResNet-50 Accuracy (%)
Method Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
CIFAR-10
Baseline (W32A32) 92.51 - 93.62 - 93.91 -
Fixed-W4A4 91.76 (92.34) - 92.9 (93.43) - 93.16 (93.73)
PoT-W4A4 90.92 (91.34) - 92.14 (92.97) - 92.53 (93.15) -
APoT-W4A4 [21] 91.83 (92.72) - 92.94 (93.47) - 92.87 (93.68) -
PoT-W4A4 + Fixed-W4A4 91.38 (91.77) - 92.66 (93.21) - 93.14 (93.76) -
APoT-W4A4 + Fixed-W4A4 [2] 91.99 (92.55) - 92.98 (93.65) - 93.22 (93.77) -
Fixed-W4A4 + Fixed-W8A4 92.54 - 93.69 - 94.11 -
RMSMP 92.58 - 93.72 - 94.03 -
CIFAR-100
Baseline (W32A32) 71.48 91.98 74.49 92.70 77.77 94.00
Fixed-W4A4 70.22 (71.16) 90.88 (91.63) 73.88 (74.37) 91.72 (92.31) 76.67 (77.41) 93.22 (93.85)
PoT-W4A4 67.11 (68.68) 89.21 (90.06) 72.97 (73.88) 91.65 (92.14) 74.58 (76.83) 92.22 (92.74)
APoT-W4A4 [21] 70.21 (71.13) 90.85 (91.69) 73.97 (74.33) 92.03 (92.49) 76.58 (77.56) 92.94 (93.55)
PoT-W4A4 + Fixed-W4A4 68.94 (70.37) 90.05 (90.89) 73.41 (74.12) 91.68 (92.25) 76.95 (77.28) 92.87 (93.66)
APoT-W4A4 + Fixed-W4A4 [2] 70.25 (71.50) 90.92 (91.82) 74.03 (74.60) 92.05 (92.63) 76.97 (77.31) 92.99 (93.86)
Fixed-W4A4 + Fixed-W8A4 71.52 91.99 74.54 92.61 77.92 94.18
RMSMP 71.51 91.97 74.61 92.69 77.85 94.13
ImageNet
Baseline (W32A32) 71.88 90.29 70.25 89.48 76.51 93.09
Fixed-W4A4 67.23 (69.26) 86.03 (88.18) 68.66 (69.72) 87.54 (88.67) 75.58 (76.22) 92.01 (92.43)
PoT-W4A4 65.88 (67.93) 84.83 (86.63) 67.11 (68.20) 85.93 (87.14) 74.77 (75.03) 91.66 (92.22)
APoT-W4A4 [21] 67.76 (68.97) 86.42 (88.17) 68.48 (70.70) 87.92 (89.60) 75.49 (76.60) 92.93 (93.10)
PoT-W4A4 + Fixed-W4A4 66.20 (68.54) 85.66 (87.65) 67.98 (68.94) 86.75 (88.66) 75.85 (76.19) 91.97 (92.77)
APoT-W4A4 + Fixed-W4A4 [2] 67.31 (68.99) 86.11 (88.04) 69.22 (70.27) 88.33 (89.42) 76.11 (76.22) 92.58 (92.86)
Fixed-W4A4 + Fixed-W8A4 68.98 89.04 70.48 89.52 76.68 93.44
RMSMP 69.02 89.07 70.73 89.62 76.62 93.36
Table 1: Results from different quantization methods for MobileNet-v2, ResNet-18, and ResNet-50 models on CIFAR-10, CIFAR-100, and ImageNet datasets. Baselines are unquantized models with 32-bit floating-point weights/activations i.e., W32A32. We explore different quantization schemes including Fixed-Point (Fixed), Power-of-Two (PoT), Additive Power-of-Two (APoT), and their combinations (PoT + Fixed with a ratio of 1:1, and APoT + Fixed with a ratio of 3:2). For these schemes, we report accuracy with 4-bit precision for weights/activations i.e., W4A4; and the accuracy in brackets is by relaxing the precision of the first and last layers to W8A8, while other layers are still in W4A4. In the last two methods, row-wise multi-precision is introduced by (i) incorporating 4-bit Fixed with 8-bit Fixed (Fixed-W4A4 + Fixed-W8A4) with a ratio of 95:5, and (ii) incorporating PoT-W4A4, Fixed-W4A4 and Fixed-W8A4 (RMSMP), with a ratio of 65:30:5. In all the quantization methods with mixed schemes and multiple precisions, the combination is row-wise within a layer.

The RMSMP quantization algorithm can train a DNN model from scratch or quantize a pre-trained model into a quantized one, such that for each layer, the numbers of filters quantized into PoT-W4A4, Fixed-W4A4, and Fixed-W8A4 follow the predefined ratio of , where .

For the assignment of quantization schemes and precisions to the filters of each layer, we use the Hessian-based method to determine which filters should use Fixed-W8A4 (higher precision). And for the rest filters, we determine PoT-W4A4 vs Fixed-W4A4 based on the variances of the weights in each filter. Once it is determined the assignment of quantization scheme and precision (PoT-W4A4, Fixed-W4A4, and Fixed-W8A4) down to the filter level for each layer, the Straight Through Estimator (STE)  [1, 35]

(6)

can be used to quantize the model into the specified schemes and precisions while solving the unavailable gradient issue during the backpropagation of the quantization process from continuous values into discrete values.

In order to determine the filters with the higher precision (Fixed-W8A4), we extend from the Hessian based method in HAWQ [10]

that by computing the max eigenvalue of the Hessian matrix of each filter, the filters with top 5% eigenvalues get assigned the higher precision. Hessian matrix is the second order derivatives of weight parameters, as following

(7)

where the first order gradient can be calculated through normal backpropagation process.

To solve the eigenvalues of Hessian matrix (i.e., ) effectively, we employ the power iteration method as proposed in [10, 28]. By introducing an auxiliary variable , we loop as follows:

(8)

so that and the last equality is proved in [10]. With Eq. 8, we can proceed to solve given by computing the right term through simple backpropagation, thus solving .

Further, the scheme assignment is determined based on the variances of the filters. Finally, after the filters are assigned the quantization schemes and precisions, we can train our quantized model through STE. Please note that we update the assignments of filters for every 10 epochs. The detailed steps are shown in Algorithm 1.

4 Evaluation

4.1 Experiment Setup

Method Approach Bit-Width First/Last Top-1 Top-5
Layer (%) (%)
Baseline - W32A32 70.25 89.48
Dorefa [38] Linear W4A4 68.10 88.10
PACT [4] Linear W4A4 69.20 89.00
DSQ [12] Linear W4A4 69.56 N/A
QIL [16] Linear W4A4 70.10 N/A
L2Q [3] Linear W4A4 - 65.92 86.72
APoT [21] Non-Lin. W4A4 8bit 8bit 70.70 89.60
LQ-Nets [36] Non-Lin. W4A4 69.30 88.80
DNAS [33] MP-Lin. Mixed 70.64 N/A
MPDNN [29] MP-Lin. Mixed 70.08 N/A
MSQ [2] MS W4A4 70.27 89.42
RMSMP (Ours) MP-MS W4A4* 70.73 89.62
Table 2: Comparisons with existing quantization works for ResNet-18 on ImageNet, using the (equivalent) 4-bit precision. The covered quantization approaches include linear, non-linear (Non-Lin.), layer-wise multi-precision with linear (MP-Lin.), mixed-scheme (MS), and our row-wise multi-precision, mixed-scheme (MP-MS). The baseline is the unquantized model with 32-bit floating-point weights/activations, i.e. W32A32. For our method, W4A4* indicates that of the weights are in 8-bit (the activations are all in 4-bit). Of particular interest are the ”First/Last Layer” columns, which provide the detailed precision for the first and last layers in those work: ✓ means the same quantization as in other layers; means no quantization (still 32-bit floating-point); and 8bit means 8-bit precision is used for the first/last layers.
Method Approach Bit-Width First/Last Top-1 Top-5
Layer (%) (%)
Baseline - W32A32 76.51 93.09
Dorefa [38] Linear W4A4 71.40 88.10
PACT [4] Linear W4A4 76.50 93.30
APoT [21] Non-Lin. W4A4 8bit 8bit 76.60 93.10
LQ-Nets [36] Non-Lin. W4A4 75.40 92.40
HAQ [31] MP-Lin. Mixed 8bit 76.15 92.89
MSQ [2] MS W4A4 76.22 92.86
RMSMP (Ours) MP-MS W4A4* 76.62 93.36
Table 3: Comparisons with existing quantization work for ResNet-50 on ImageNet, using the (equivalent) 4-bit precision.
Method Approach Bit-Width First/Last Top-1 Top-5
Layer (%) (%)
Baseline - W32A32 71.88 90.29
PACT [4] Linear W4A4 61.40 N/A
DSQ [12] Non-Lin. W4A4 64.80 N/A
HAQ [31] MP-Lin. Mixed 8bit 67.01 87.46
MSQ [2] MS W4A4 68.99 88.04
RMSMP (Ours) MP-MS W4A4* 69.02 89.07
Table 4: Comparisons with existing work for MobileNet-V2 on ImageNet, using the (equivalent) 4-bit precision.
Method Approach Bit-Width Evaluation Result
Metric ()
BERT on SST-2
Baseline - W32A32 Accuracy 93.00
Fixed Linear W4A4 Accuracy 92.78
PoT Non-Lin. W4A4 Accuracy 92.43
PoT + Fixed Mixed W4A4 Accuracy 92.78
QBERT [28] MP-Lin. Mixed Accuracy 92.66
RMSMP (Ours) MP-MS W4A4* Accuracy 92.87
BERT on MNLI
Baseline - W32A32 mm-Acc. 84.90
Fixed Linear W4A4 mm-Acc. 84.71
PoT Non-Lin. W4A4 mm-Acc. 84.79
PoT + Fixed Mixed W4A4 mm-Acc. 84.51
QBERT [28] MP-Lin. Mixed mm-Acc. 84.17
RMSMP (ours) MP-MS W4A4* mm-Acc. 84.83
Table 5: Comparisons with existing work for the BERT model on SST-2 and MNLI datasets, using the (equivalent) 4-bit precision.

The evaluation metric of SST-2 is accuracy, while the evaluation metric of MNLI is mismatched accuracy (mm-Acc.). For both metrics, a higher value is better.

Quantization Method PoT-W4A4 : Fixed-W4A4 : Fixed-W8A4 First/Last Layer Quantization Accuracy Results on FPGA XC7Z020 Results on FPGA XC7Z045
Top-1 Top-5 Utilization Throughput Latency Utilization Throughput Latency
(%) (%) LUT DSP (GOP/s) (ms) LUT DSP (GOP/s) (ms)
(1) Fixed 0:100:0 8-bit Fixed 69.72 88.67 26% 100% 29.6 122.6 21% 100% 115.6 31.4
(2) Fixed 0:100:0 68.66 87.54 23% 100% 36.5 99.3 19% 100% 142.7 25.4
(3) PoT 100:0:0 8-bit Fixed 68.20 87.14 41% 100% 62.4 58.1 40% 100% 290.5 12.5
(4) PoT 100:0:0 67.11 85.93 43% 12% 72.2 50.2 43% 3% 352.6 10.3
(5) PoT + Fixed 50:50:0 8-bit Fixed 68.94 88.66 50% 100% 50.3 72.0 48% 100% 196.8 18.4
(6) PoT + Fixed 50:50:0 67.98 86.75 46% 100% 75.8 47.8 45% 100% 296.3 12.2
(7) PoT + Fixed 60:40:0 8-bit Fixed 68.53 88.47 52% 100% 57.0 63.6 - - - -
(8) PoT + Fixed 67:33:0 8-bit Fixed 68.46 88.22 - - - - 63% 100% 245.8 14.8
MSQ-1 [2] 60:40 69.22 88.33 53% 100% 77.0 47.1 - - - -
MSQ-2 [2] 67:33 69.14 88.17 - - - - 66% 100% 359.2 10.1
RMSMP-1 60:35:5 70.66 89.53 57% 100% 89.0 40.7 - - - -
RMSMP-2 65:30:5 70.73 89.62 - - - - 67% 100% 421.1 8.6
Table 6: Implementations of different quantization methods on two FPGA boards for ResNet-18 on ImageNet, using the (equivalent) 4-bit precision. For methods (1)(8) and the two RMSMP, each quantization method corresponds to a ratio of PoT-W4A4 : Fixed-W4A4 : Fixed-W8A4 within each layer. The first and last layers are quantized into either 8-bit Fixed, or the same as other layers (denoted by ✓). The last two rows are by MSQ [2] with the ratios of APoT-W4A4 and Fixed-W4A4 as 60:40 and 67:33, respectively. Note that MSQ uses APoT instead of PoT, and does not use multi-precision within layers. The proposed RMSMP method is compared with other methods in terms of accuracy as well as resource utilization rate, throughput, and latency on two FPGAs. For each FPGA board, the utilization ratios of look-up tables (LUT) and digital signal processing blocks (DSP) under each quantization method are provided in percentage. The utilization of LUT demonstrates the effectiveness of quantization method on inference speedup. For XC7Z020, the total numbers of LUTs and DSPs are 53.2K and 220, respectively. For XC7Z045, the total numbers of LUTs and DSPs are 218.6K and 900.

Our RMSMP is evaluated on the image classification and natural language processing (NLP) applications. All the model training and quantization are conducted on NVIDIA TITAN RTX and GeForce RTX 2080 Ti GPUs, with CUDA 10.2 and PyTorch 1.5 frameworks on Ubuntu 18.04 operating system. The quantization algorithm utilizes the same data augmentation techniques as those used in training of baseline (32-bit floating-point) models. Training tricks such as step or cosine learning rate decay and weight regularization are also the same for training the baselines and in the quantization algorithm.

The evaluated models on image classification tasks include ResNet-18, ResNet-50 [14] and MobileNet-v2 [27] on CIFAR-10, CIFAR-100 [18] and ImageNet ILSVRC-2012 [17] datasets. Baseline models in 32-bit floating-point representation for CIFAR-10 and CIFAR-100 datasets are trained from scratch for epochs, then quantized for epochs. For ImageNet dataset, pre-trained 32-bit floating-point models are quantized for epochs. The initial learning rates are for CIFAR-10, for CIFAR-100, and for ImageNet. For NLP tasks with BERT [8] models, we evaluate on the Stanford Sentiment Treebank (SST-2) and Multi-Genre Natural Language Inference (MNLI) datasets from the General Language Understanding Evaluation (GLUE) [30] benchmark. The pre-trained BERT models are from HuggingFace Transformer [32]. Quantization and finetuning are simultaneously performed for 3 epochs with the initial learning rate of . Besides, for all models, the maximum iteration number of the power iteration method is set to 20.

In addition to the model accuracy, we demonstrate the hardware efficiency (in terms of resource utilization, throughput, and latency) of the proposed RMSMP compared with other quantization methods through implementing the architectures with heterogeneous GEMM cores on FPGA devices, i.e., Zynq XC7Z020 and XC7Z045. Different ratios of quantization schemes/precisions are realized by adjusting the ratio among the processing element (PE) array sizes in the GEMM cores. Our optimal RMSMP implementation utilizes three heterogeneous GEMM cores, i.e., for processing with PoT-W4A4 filters, for Fixed-W4A4, and for Fixed-W8A4. The working frequency is set to 100MHz for all implementations.

4.2 Accuracy Performance

Table 1 compares the proposed RMSMP method with other quantization methods for three models on three datasets. RMSMP generally achieves higher top-1 accuracy than methods with Fixed, PoT, APoT, PoT + Fixed, and APoT + Fixed using 4-bit precision, and comparable accuracy to the Fixed-W4A4 + Fixed-W8A4 method.

The results in Tables 23, and 4 demonstrate the effectiveness of the proposed RMSMP method for three models i.e., ResNet-18, ResNet-50, and MobileNet-V2, respectively, on ImageNet using mainly 4-bit precision. For ResNet-18, RMSMP increases the top-1 accuracy by compared with the baseline, and outperforms existing methods by up to . For ResNet-50, RMSMP increases the baseline accuracy by , and outperforms existing methods by up to . As for MobileNet-V2, RMSMP achieves the least accuracy loss, with up to increase on top-1 accuracy than other methods. In addition, comparisons on a large language model BERT are provided in Table 5. The BERT is a large model with higher redundancy. And therefore it is easy to achieve negligible accuracy loss after quantization. All the quantization methods in Table 5 perform very well and RMSMP yields slightly better accuracy than other methods on SST-2 and MNLI datasets.

4.3 Hardware Efficiency

The proposed RMSMP method is compared with other quantization methods on two FPGA boards with the ResNet-18 model on the ImageNet dataset, as displayed in Table 6. The total amount of resources on a specific FPGA board determines the optimal ratio of PoT-W4A4 : Fixed-W4A4 : Fixed-W8A4 schemes. Specifically, the optimal ratio on XC7Z020 is 60:35:5 (RMSMP-1), resulting in top-1 accuracy of 70.66% and latency of 40.7ms, while the optimal ratio on XC7Z045 is 65:30:5 (RMSMP-2), leading to top-1 accuracy of 70.73% and latency of 8.6ms. RMSMP with the optimal ratio of quantization schemes achieves up to speedup on XC7Z020 and up to speedup on XC7Z045, comparing with the single-scheme and single-precision method (1) Fixed. Methods (1)(3)(5)(7)(8) employ 8-bit Fixed quantization for the first and last layers to maintain the accuracy, whereas this hurts the inference speed. Compared with methods (1)(2)(5)(6), RMSMP enhances the utilization ratio of LUTs to increase the throughput. Additionally, RMSMP achieves higher accuracy and lower latency with full DSP usage compared with the method (4). We also compare with MSQ [2] with the first/last layer quantized into 4-bit. Our RMSMP outperforms MSQ in both accuracy and inference speed.

5 Conclusion

This work proposes a novel DNN quantization framework, namely RMSMP, with a Row-wise Mixed-Scheme and Multi-Precision approach, featuring (i) layer-wise uniformality to fulfill the requirement of practical hardware implementation, (ii) row-wise flexibility for mixed schemes and multiple precisions, (iii) hardware-informative selection of candidate schemes and precisions (bit-widths) for significantly reducing the algorithm search space, and (iv) best accuracy performance among the state-of-the-arts. This paper adopts a highly hardware-informative solution strategy to reduce the problem search space. Evaluating with the image classification and natural language processing applications, our RMSMP achieves superior accuracy performance. And we also evaluate our RMSMP with FPGA devices.

Acknowledgment

This work is partly supported by the National Science Foundation CCF-1901378, and CCF-1937500. Any opinions, findings, and conclusions or recommendations in this material are those of the authors and do not necessarily reflect the views of NSF.

References

  • [1] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    arXiv preprint arXiv:1308.3432. Cited by: §3.3.
  • [2] S. Chang, Y. Li, M. Sun, R. Shi, H. K. So, X. Qian, Y. Wang, and X. Lin (2021) Mix and match: a novel fpga-centric deep neural network quantization framework. In Proceedings of the 2021 IEEE International Symposium on High Performance Computer Architecture, Cited by: §2.1.2, Table 1, §4.3, Table 2, Table 3, Table 4, Table 6.
  • [3] G. Cheng, L. Ye, L. Tao, Z. Xiaofan, H. Cong, C. Deming, and C. Yao (2019) l2q: An ultra-low loss quantization method for dnn. The 2019 International Joint Conference on Neural Networks (IJCNN). Cited by: §1, §2.1.1, Table 2.
  • [4] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1, §2.1.1, Table 2, Table 3, Table 4.
  • [5] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems (NeurIPS), pp. 3123–3131. Cited by: §1, §2.1.1.
  • [6] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1, §2.1.1.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §4.1.
  • [9] Z. Dong, Z. Yao, Y. Cai, D. Arfeen, A. Gholami, M. W. Mahoney, and K. Keutzer (2019) HAWQ-v2: hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852. Cited by: §1, §2.2, §2.2.
  • [10] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2019) Hawq: hessian aware quantization of neural networks with mixed-precision. In

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    ,
    pp. 293–302. Cited by: §1, §2.2, §2.2, §3.3, §3.3.
  • [11] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha (2019) Learned step size quantization. International Conference on Learning Representations (ICLR). Cited by: §1, §2.1.1.
  • [12] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4852–4861. Cited by: §1, §2.1.1, Table 2, Table 4.
  • [13] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems (NeurIPS), pp. 1135–1143. Cited by: §2.2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    ,
    pp. 770–778. Cited by: §4.1.
  • [15] Z. He and D. Fan (2019) Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11438–11446. Cited by: §1, §2.1.1.
  • [16] S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang, and C. Choi (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4350–4359. Cited by: §1, §2.1.1, Table 2.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
  • [18] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [19] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin (2018) Extremely low bit neural network: squeeze the last bit out with admm. In

    Thirty-Second AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §1, §2.1.2.
  • [20] F. Li, B. Zhang, and B. Liu (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1, §2.1.1.
  • [21] Y. Li, X. Dong, and W. Wang (2020) Additive powers-of-two quantization: an efficient non-uniform discretization for neural networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2.1.2, Table 1, Table 2, Table 3.
  • [22] X. Lin, C. Zhao, and W. Pan (2017) Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp. 345–353. Cited by: §1, §2.1.1.
  • [23] Q. Lou, F. Guo, M. Kim, L. Liu, and L. Jiang (2019) AutoQ: automated kernel-wise neural network quantization. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
  • [24] D. Miyashita, E. H. Lee, and B. Murmann (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §1, §2.1.2.
  • [25] E. Park, S. Yoo, and P. Vajda (2018) Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 580–595. Cited by: §2.2.
  • [26] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision (ECCV), pp. 525–542. Cited by: §1, §2.1.1.
  • [27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4510–4520. Cited by: §4.1.
  • [28] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) Q-bert: hessian based ultra low precision quantization of bert.. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), pp. 8815–8821. Cited by: §1, §2.2, §2.2, §3.3, Table 5.
  • [29] S. Uhlich, L. Mauch, F. Cardinaux, K. Yoshiyama, J. A. Garcia, S. Tiedemann, T. Kemp, and A. Nakamura (2020) Mixed precision dnns: all you need is a good parametrization. International Conference on Learning Representations (ICLR). Cited by: §1, §2.2, §2.2, Table 2.
  • [30] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. Cited by: §4.1.
  • [31] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. International Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.2, §2.2, Table 3, Table 4.
  • [32] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.1.
  • [33] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090. Cited by: §1, §2.2, §2.2, Table 2.
  • [34] Z. Yao, A. Gholami, K. Keutzer, and M. Mahoney (2019) PyHessian: neural networks through the lens of the hessian. arXiv preprint arXiv:1912.07145. Cited by: §1, §2.2, §2.2.
  • [35] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin (2018) Understanding straight-through estimator in training activation quantized neural nets. In International Conference on Learning Representations (ICLR), Cited by: §3.3.
  • [36] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365–382. Cited by: §1, §2.1.2, Table 2, Table 3.
  • [37] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.2.
  • [38] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2.1.1, Table 2, Table 3.
  • [39] C. Zhu, S. Han, H. Mao, and W. J. Dally (2017) Trained ternary quantization. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1.1.