Post-training Quantization with Multiple Points: Mixed Precision without Mixed Precision

02/20/2020 ∙ by Xingchao Liu, et al. ∙ 0

We consider the post-training quantization problem, which discretizes the weights of pre-trained deep neural networks without re-training the model. We propose multipoint quantization, a quantization method that approximates a full-precision weight vector using a linear combination of multiple vectors of low-bit numbers; this is in contrast to typical quantization methods that approximate each weight using a single low precision number. Computationally, we construct the multipoint quantization with an efficient greedy selection procedure, and adaptively decides the number of low precision points on each quantized weight vector based on the error of its output. This allows us to achieve higher precision levels for important weights that greatly influence the outputs, yielding an 'effect of mixed precision' but without physical mixed precision implementations (which requires specialized hardware accelerators). Empirically, our method can be implemented by common operands, bringing almost no memory and computation overhead. We show that our method outperforms a range of state-of-the-art methods on ImageNet classification and it can be generalized to more challenging tasks like PASCAL VOC object detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The past decade has witnessed the great success of deep neural networks (DNNs) in many fields. Nonetheless, DNNs require expensive computational resources and enormous storage space, making it difficult for deployment on resource-constrained devices, such as devices for Internet of Things (IoT), processors on smart phones, and embeded controllers in mobile robots (howard2017mobilenets; xu2018scaling).

Quantization is a promising method for creating more energy-efficient deep learning systems 

(han2015deep; hubara2017quantized; neta_zmora_2018_1297430; cheng2018recent). By approximating real-valued weights and activations using low-bit numbers, quantized neural networks (QNNs) trained with state-of-the-art algorithms (e.g., courbariaux2015binaryconnect; rastegari2016xnor; louizos2018relaxed; li2019additive) can be shown to perform similarly as their full-precision counterparts (e.g., jung2019learning; li2019additive).

This work focuses on the problem of post-training quantization, which aims to generate a QNN from a pretrained full-precision network, without accessing the original training data (e.g., sung2015resiliency; krishnamoorthi2018quantizing; zhao2019improving; meller2019same; banner2019post; nagel2019data; choukroun2019low). This scenario appears widely in practice. For example, when a client wants to deploy a full-precision model provided by a machine learning service provider in low-precision, the client may have no access to the original training data due to privacy policy. In addition, compared with training QNNs from scratch, prost-training quantization is much more efficient computationally.

Mixed precision is a recent advanced technology to boost the performance of QNNs (wang2019haq; banner2019post; gong2019mixed; Dong_2019_ICCV). The idea is to assign more bits to important layers (or channels) and less bits to unimportant layers/channels to better control the overall quantization error and balance the accuracy and cost more efficiently. The difficulty, however, is that current mixed precision methods require specialized hardware (e.g., wang2019haq). Most commodity hardware do not support efficient mixed precision computation (e.g. due to chip area constraints (horowitz20141)). This makes it difficult to implement mixed precision in practice, despite that it is highly desirable.

In this paper, we propose multipoint quantization for post-training quantization, which can achieve the flexibility similar to mixed precision, but uses only a single precision level. The idea is to approximate a full-precision weight vector by a linear combination of multiple low-bit vectors. This allows us to use a larger number of low-bit vectors to approximate the weights of more important channels, while use less points to approximate the insensitive channels. It enables a flexible trade-off between accuracy and cost at a per-channel basis, while using only a single precision level. Because it does not require physical mixed precision implementation, our method can be easily deployed on commodity hardware by common operands.

We propose a greedy algorithm to iteratively find the optimal low-bit vectors to minimize the approximation error. The algorithm sequentially adds the low-bit vector that yields largest improvement on the error, until a stopping criterion is met. We develop a theoretical analysis, showing that the error decays exponentially with the number of low-bit vectors used. The fast decay of the greedy algorithm ensures small overhead after adding these additional points.

Our multipoint quantization is computationally efficient. The key advantage is that it only involves multiply-accumulate (MAC) operations during inference, which has been highly optimized in normal deep learning devices. We adaptively decide the number of low precision points for each channel by measuring its output error. Empirically, we find that there are only a small number of channels that require a large number of points. By applying multipoint quantization on these channels, the performance of the QNN is improved significantly without any training or fine-tuning. Empirically, it only brings a negligible increase of memory cost.

We conduct experiments on ImageNet classification with different neural architectures. Our method performs favorably against the state-of-the-art methods. It even outperforms the method proposed by banner2019post in accuracy, which exploits physical mixed precision. We also verify the generalizability of our approach by applying it to PASCAL VOC object detection tasks.

    (a) Naive Quantization         (b) Multipoint Quantization
Figure 1: In naive quantization, a point is approximated by its the nearest grid point (a). In multipoint quantization with two points (n=2), a point is approximated by the nearest point on the linear segments shown in (b). One can see that multipoint quantization has much higher representation power.

2 Method

Section 2.1 introduces backgrounds on post-training quantization. We then discuss the main framework of multipoint quantization in Section 2.2, its application to deep neural networks in Section 2.3 and its implementation overhead in Section 2.4.

2.1 Preliminaries: Post-trainig Quantization

Given a pretrained full-precision neural network , the goal of post-training quantization is to generate a quantized neural network (QNN) with high performance. We assume the full training dataset of is unavailable, but there is a small calibration dataset , where is a very small size, e.g.

. The calibration set is used only for choosing a small number of hyperparameters of our algorithm, and we can not directly train

on it because it is too small and would cause overfitting.

The -bit linear quantization amounts to approximate real numbers using the following quantization set ,

(1)

where denotes the uniform grid on with increment between elements, and is a scaling factor that controls the length of and specifies center of .

Then we map a floating number to by,

(2)

where denotes the nearest rounding operator w.r.t. . For a real vector , we map it to by,

(3)

Further,

can be generalized to higher dimensional tensors by first stretching them to one-dimensional vectors then applying Eq. 

3.

Since all the values are larger than (or smaller than ) will be clipped, is also called the clipping factor. Supposing is used to quantize vector , a naive choice of is the element with the maximal absolute value in . In this case, no element will be clipped. However, because the weights in a layer/channel of a neural network empirically follows a bell-shaped distribution, properly shrinking can boost the performance. Different clipping methods have been proposed to optimize  (zhao2019improving).

There are two common configurations for post-training quantization, per-layer quantization and per-channel quantization. Per-layer quantization assigns the same and for all the weights in the same layer. Per-channel quantization is more fine-grained, and it uses different and for different channels. The latter can achieve higher precision, but it also requires more complicated hardware design.

2.2 Multipoint Quantization and Optimization

1:  Input: weight , integer , maximal step size for grid search , a fixed quantization set .
2:  Initialize the residual .
3:  for  do
4:     Compute , the minimal gap of , as definition 1.
5:     Set step size and search range as Eq. 11.
6:     Solve Eq. 9 by grid search in with step size to find .
7:     Set , .
8:  end for
9:  Return .
Algorithm 1 Optimization of Problem 5

We propose multipoint quantization, which can be implemented with common operands on commodity hardware.

Consider a linear layer in a neural network, which is either a fully-connected (FC) layer or a convolutional layer. The weight of a channel is a vector for FC layer, or a convolution kernel for convolutional layer. For simplicity, we only introduce the case of FC layer in this section. It can be easily generalized to convolutional layers. Supposing the input to this layer is -dimensional , then the real-valued weight of a channel can be denoted as . Multipoint quantization approximates with a weighted sum of a set of low precision weight vectors,

(4)

where and for Multipoint quantization allows more freedom in representing the full-precision weight. Fig. 1 demonstrates the situation when . We add an additional constraint, , for better visualization. Naive quantization approximates a weight by the nearest grid points, while multipoint quantization approximates it with the nearest point on the linear segments. If we release the constraint , we can actually represent every point on the 2-dimensional planar with multipoint quantization.

Given a fixed , we want to find optimal that minimizes the -norm between the real-valued weight and the weighted sum,

(5)

Problem 5

yields a difficult combinatorial optimization. We are able to get exact approximation when

by taking and , where is a one hot vector with the -th element as 1 and other elements as 0. However, is always large in deep neural networks, and our goal is to approximate with a small enough . Hence, we propose an efficient greedy method for solving it, which sequentially adds the best pairs one by one. Specifically, we obtain the -th pair by approximate the residual from the previous pairs,

(6)

where is the residual from the first pairs,

(7)

For a fixed , we have,

(8)

Now we only need to solve optimal ,

(9)

Because is not differentiable, it is hard to optimize by gradient descent. Instead, we adopt grid search to find efficiently. Once the optimal is found, the corresponding is,

(10)

Choice of Parameters for Grid Searching :

Grid search enumerates all the values from set , and selects the value that achieves the lowest error. The parameters of grid search, search range and step size, are defined as the interval and the increment respectively. The choice of search range and step size are critical. We first define minimal gap of vector , and then give the choice of search range and step size.

The minimal gap is the minimal distance between two elements in a vector . It restricts the maximal value of step size.

Definition 1 (Minimal Gap)

Given any vector , the minimal gap of is defined as

Then we propose the following choice of and ,

(11)

where is a predefined maximal step size to accelerate convergence. In Sec. 3, we show that by choosing and like this, our algorithm is guaranteed to converge to zero. As increases, the dimension of the approximation set increases. Intuitively, the nearest distance from an arbitrary point to the approximation set decreases exponentially with . We rigorously prove that the greedy algorithm decays in an exponential rate in Sec. 3. Algorithm 1 recaptures the optimization procedure.

2.3 Multipoint Quantization on Deep Networks

Error of Output
    Index of Layers
Figure 2:

The error plot of the output in a quantized ResNet-18. The red dot is the mean of output error of all channels in the corresponding layer. The dark bars show the maximum and minimum. The shallow region indicates the 15-th to 85-th percentile. Observations: (1) only a small portion of neurons have large error; (2) the starting layers are more sensitive to quantization.

1:  Input: A full-precision network , a predefined threshold , a calibration set of data points .
2:  Run forward pass of with calibration set to get the input batch for each layer in ,
3:  for each layer in  do
4:     for each channel in layer  do
5:        .
6:        if  then
7:           Apply multipoint quantization with Algorithm 1 and keep increasing until . Get .
8:           
9:        end if
10:     end for
11:  end for
12:  Return QNN
Algorithm 2 Generate QNN with Multipoint Quantization

We describe how to apply multipoint quantization to deep neural networks. Using multipoint quantization can decrease the quantization error of a channel significantly, but every additional quantized filter requires additional memory and computation consumption. Therefore, to apply it to deep networks, we must select the important channels to compensate for their quantization error with multipoint quantization.

For a layer with -dimensional input, we adopt a simple criterion, output error, to determine the target channels. Output error is the difference of the output of a channel before and after quantization. Suppose the weight of a channel is , its output error is defined as,

(12)

where is the input batch to , collected by running forward pass of with calibration set . Our goal is to keep the output of each channel invariant. If is larger than a predefined threshold , we apply multipoint quantization to this channel and increase until . A similar idea is leveraged to determine the optimal clipping factor ,

(13)

Here, is the set of weights sharing the same . For per-layer quantization, is contains the weights of all the channels in a layer. For per-channel quantization, contains only one element, which is the weight of a channel.

2.4 Analysis of Overhead

(a) Computation flowchart of typical dot product in a QNN
(b) Computation flowchart of dot product of multipoint quantization
Figure 3: Flowchart of typical and multipoint quantization.

We introduce how the computation of dot product can be implemented with common operands when adopting multipoint quantization. Then we analyze the overhead of memory and computation.

Fig. 3 (a) demonstrates the computation flowchart of dot product in a normal QNN (neta_zmora_2018_1297430; jacob2018quantization). For dimensional input and weight with bits, computing the dot product requires multiplications between two bit integers. The result of the dot product is stored in a 32-bit accumulator, since the sum of the individual products could be more than bits. The above operation is called Multiply-Accumulate (MAC), which has been highly optimized in modern deep learning hardware (chen2016eyeriss). The 32-bit integer is then quantized according to the quantization scheme of the output part.

Now we delve into the computation pipeline when . Because , we transform them to a hardware-friendly integer representation beforehand,

(14)

Here, determines the precision of the quantized . We use the same for all the weights with multipoint quantization in the network. are 32-bit integers. The quantization of can be performed off-line before deploying the QNN. We point out that,

(15)

We divide the computation into three steps. Readers can follow Fig.3 (b).

Step 1: Matrix Multiplication    In the first step, we compute . The results are stored in the 32-bit accumulators.

Step 2: Coefficient Multiplication and Summation    The second step first multiplies with , containing times of multiplication between two 32-bit integers. Then we sum together with times of addition.

Step 3: Bit Shift Finally, the division with can be efficiently implemented by shifting bits of to the left. We ignore the computation overhead in this step.

 

Method Memory MULs ADDs

 

Naive
Multipoint
Table 1: Comparison of memory and computation consumption between a naively quantized layer and a layer using multipoint quantization.

Overall Storage/ Computation Overhead: We count the number of binary operations following the same bit-op computation strategy as li2019additive; zhou2016dorefa. The multiplication between two -bit integer costs binary operations. Suppose we have a weight vector . We compare the memory cost and the computational cost (dot product with bit input ) between naive quantization and multipoint quantization . The results are summarized in Table 1. Because is always large in neural networks, so the memory and the computation overhead is approximately proportional to the number .

3 Theoretical Analysis

    iteration
Figure 4: A toy experiment where is randomly generated. The y-axis refers to . We test three different step sizes for grid search. As gets smaller, approaches zero and also becomes smaller. The dashed line indicates the step that for the first time. Before the dashed line, decays exponentially. After the dashed line, the grid search does not have enough precision and thus the residual can no more be further reduced.

In this section, we give a convergence analysis of the proposed optimization procedure. We prove the quantizataion error of the proposed greedy optimization decays exponentially w.r.t. the number of points.

Suppose that we want to quantize a real-valued -dimensional weight . For simplicity, we assume a binary precision in this section, which leads to . Our proof can be generalized to easily. We follow the notations in Section 2.2. At the -th iteration, the residual , and are defined by Eq. (7), Eq. (9) and Eq. (10), respectively. The minimal gap of a vector , , is defined in Definition (1).

Let the loss function be

. Now we can prove the following rate under mild assumptions.

Theorem 1 (Exponential Decay)

Suppose that at the -th iteration of the algorithm, is obtained by grid searching from the range , where is the step size of the grid search. Assume that for any step before termination, where is a predefined maximal step size. We have

for some constant .

The proof is in Appendix A. Note that is usually much smaller than the exponential term and thus can be ignored. Theorem 1 suggests that if we use sufficiently small step size () for the optimization, the loss will decrease exponentially. Because of the exponentially fast decay of the algorithm, we find that for most of the channels using multipoint quantization in practice. Fig. 4 justifies our theoretical analysis by a toy experiment.

4 Experiments

We evaluate our method on two tasks, ImageNet classification  (krizhevsky2012imagenet) and PASCAL VOC object detection (pascal-voc-2007). Our evaluation contains various neural networks.

Model Bits (W/A) Method Acc (Top-1/Top-5) (%) Size OPs

 

VGG19-BN 32/32 Full-Precision 74.24/91.85 76.42MB -
4/8 w/o Multipoint 60.81/83.68 9.55MB 9.754G
OCS (zhao2019improving) 62.11/84.59 10.70MB 10.924G
Ours 64.06/86.14 9.59MB 10.923G

 

ResNet-18 32/32 Full-Precision 69.76/89.08 42.56MB -
4/8 w/o Multipoint 54.04/78.10 5.32MB 847.78M
OCS (zhao2019improving) 58.05/81.57 6.20MB 988.51M
Ours 61.68/84.03 5.37MB 983.22M

 

ResNet-101 32/32 Full-Precision 77.37/93.56 161.68MB -
4/8 w/o Multipoint 61.04/83.02 20.21MB 3.841G
OCS (zhao2019improving) 70.27/89.73 23.40MB 4.448G
Ours 73.09/91.34 20.86MB 4.446G

 

WideResNet-50 32/32 Full-Precision 78.51/94.09 262.64MB -
4/8 w/o Multipoint 61.78/83.60 31.83MB 5.639G
OCS (zhao2019improving) 68.54/88.68 35.97MB 6.372G
Ours 70.47/89.43 32.08MB 6.365G

 

Inception-v3 32/32 Full-Precision 77.45/93.56 82.96MB -
4/8 w/o Multipoint 5.17/12.85 10.37MB 2.846G
OCS (zhao2019improving) 8.49/17.75 12.16MB 3.338G
Ours 33.89/56.07 10.42MB 3.337G

 

Mobilenet-v2 32/32 Full-Precision 71.78/90.19 8.36MB -
8/8 w/o Multipoint 0.06/0.15 2.090MB 299.49M
OCS (zhao2019improving) N/A N/A N/A
Ours 70.70/89.70 2.091MB 357.29M

 

Table 2: Per-layer quantization on ImageNet Benchmark (W=Weight, A=Activation, M=, G=, Acc=Accuracy). Bold refers to the method with highest Top-1 accuracy. Note that OCS cannot be applied to MobileNet-V2 because it cannot deal with group convolution.
Model Bits (W/A) Method Acc (Top-1/Top-5) (%) Size OPs

 

VGG19-BN 32/32 Full-Precision 74.24/91.85 76.42MB -
4/4 w/o Multipoint 52.08/76.19 9.55MB 4.877G
MP (banner2019post) 70.59/90.08 9.55MB 4.877G
Ours 71.96/90.75 9.63MB 5.525G
Ours + Clip 72.78/91.23 9.58MB 5.354G

 

ResNet-18 32/32 Full-Precision 69.76/89.08 42.56MB -
4/4 w/o Multipoint 57.00/80.40 5.32MB 423.89M
MP (banner2019post) 64.78/85.90 5.32MB 423.89M
Ours 64.29/85.59 5.39MB 494.16M
Ours + Clip 65.89/86.68 5.41MB 470.89M

 

ResNet-50 32/32 Full-Precision 76.15/92.87 89.44MB -
4/4 w/o Multipoint 65.88/86.93 11.18MB 992.28M
MP (banner2019post) 72.52/90.80 11.18MB 992.28M
Ours 71.88/90.43 11.33MB 1.148G
Ours + Clip 72.67/91.11 11.32MB 1.128G

 

ResNet-101 32/32 Full-Precision 77.37/93.56 161.68MB -
4/4 w/o Multipoint 69.67/89.21 20.21MB 1.920G
MP (banner2019post) 74.22/91.95 20.21MB 1.920G
Ours 71.56/90.36 20.82MB 2.177G
Ours+Clip 72.85/91.16 21.04MB 2.189G

 

Inception-v3 32/32 Full-Precision 77.45/93.56 82.96MB -
4/4 w/o Multipoint 12.12/25.24 10.37MB 1.423G
MP (banner2019post) 60.64/82.15 10.37MB 1.423G
Ours 61.22/83.27 10.44MB 1.692G
Ours+Clip 65.49/86.72 10.38MB 1.519G

 

Mobilenet-v2 32/32 Full-Precision 71.78/90.19 8.36MB -
4/4 w/o Multipoint 6.86/16.76 1.04MB 74.87M
MP (banner2019post) 42.61/67.78 1.04MB 74.87M
Ours 27.52/50.80 1.05MB 91.16M
Ours+Clip 55.54/79.10 1.045MB 85.88M

 

Table 3: Per-channel quantization on ImageNet Benchmark (W=Weight, A=Activation, M=, G=, MP=Mixed Precision, Acc=Accuracy). Note that MP requires specialized hardware. Bold refers to the method with highest Top-1 accuracy. ‘Clip’ means using the optimal clipping factor by solving Eq. 13.

4.1 Experiment Results on ImageNet Benchmark

We evaluate our method on the ImageNet classification benchmark. For fair comparison, we use the pretrained models provided by PyTorch 

111https://pytorch.org/ as others (zhao2019improving; banner2019post). We take 256 images from the training set as the calibration set. Calibration set is used to quantize activations and choose the channels to perform multipoint quantization. To improve the performance of low-bit activation quantization, we pick the optimal clipping factor for activations by minimizing the mean square error (sung2015resiliency). Like previous works, the weights of the first and the last layer are always quantized to 8-bit (nahshan2019loss; li2019additive; banner2019post). For all experiments, we set the maximal step size for grid search in Eq. 9 to .

We report both model size and number of operations under different bit-width settings for all the methods. The first and the last layer are not counted. We follow the same bit-op computation strategy as li2019additive; zhou2016dorefa to count the number of binary operations. One OP is defined as one multiplication between an 8-bit weight and an 8-bit activation, which takes 64 binary operations. The multiplication between a -bit and a -bit integer is counted as OPs.

We provide two categories of results here: per-layer quantization and per-channel quantization. In per-layer quantization, all the channels in a layer exploit the same and . In per-channel quantization, each channel has its own parameter and . For both settings, we test six different networks in our experiments, including VGG-19 with BN (simonyan2014very), ResNet-18, ResNet-101, WideResNet-50 (he2016deep), Inception-v3 (szegedy2015going) and MobileNet-v2 (sandler2018mobilenetv2).

Per-layer Quantization

For per-layer quantization, we compare our method with a state-of-the-art (SOTA) baseline, Outlier Channel Splitting (OCS) 

(zhao2019improving). OCS duplicates the channel with the maximal absolute value and halves it to mitigate the quantization error. For fair comparison, we choose the best clipping method among four methods for OCS according to their paper (sung2015resiliency; migacz20178; banner2019post). We select the threshold such that the OPs of the QNN with multipoint quantization is about 1.15 times of the naive QNN. For fair comparison, we expand the network with OCS until it has similar OPs with the QNN using multipoint quantization. The results without multipoint quantization (denoted ‘w/o Multipoint’ in Table. 7) serve as another baseline. We quantize the activations and the weights to the same precision as the baselines. Experiment results are presented in Table. 7. It shows that our method obtains consistently significant gain on all the models compared with ‘w/o Multipoint’, with little increase on memory overhead. Our method also consistently outperforms the performance of OCS under any computational constraint. Especially, on ResNet-18, ResNet-101 and Inception-v3, our method surpasses OCS by more than 2% Top-1 accuracy. OCS cannot quantize MobileNet-v2 due to the group convolution layers, while our method nearly recovers the full-precision accuracy. Our method achieves similar performance with Data Free Quantization (nagel2019data) (71.19% Top-1 accuracy with 8-bit MobileNet-v2), which focuses on 8-bit quantization on MobileNets only. Note that this method is orthogonal to ours and we expect to obtain more improvement by combining with it.

Per-channel Quantization For per-channel quantization, we compare our method with another SOTA baseline,  banner2019post. banner2019post requires physical per-channel mixed precision computation since it assigns different bits to different channels. We denote it as ’Mixed Precision (MP)’. All networks are quantized with asymmetric per-channel quantization (). Since per-channel quantization has higher precision, weight clipping is not performed for naive quantization, which means that . We quantize both weights and activations to 4 bits. Experiment results are presented in Table. 8.

Our method outperforms MP on VGG19-BN and Inception-v3 even without weight clipping. After performing weight clipping with Eq. 13, our method beats MP on 5 out of 6 networks, except for ResNet-101. On VGG19-BN, Inception-v3 and MobileNet-v2, compared with MP, the Top-1 accuracy of our method after clipping is more than 2% higher. In the experiments, all the memory overhead is smaller than 5% and the computation overhead is no more than 17% compared with the naive QNN.

4.2 Experiment Results on PASCAL VOC Object Detection Benchmark

We test Single Shot MultiBox Object Detector (SSD), which is a well-known object detection framework. We use an open-source implementation 

222https://github.com/amdegroot/ssd.pytorch. The backbone network is VGG16. We apply per-layer quantization and per-channel quantization on all the layers, excluding localization layers and classification layers. Due to the GPU memory constraint, the calibration set only contains 6 images. We measure the mean average precision (mAP), size and OPs of the quantized model. We perform activation clipping and weight clipping for both settings.

In per-layer quantization, our method increases the performance of the baseline by over 1% mAP (72.86% 74.10%). When weight is quantized to 3-bit, our method boost the baseline by 4.38% mAP (42.56% 46.94%) with little memory overhead of 0.01MB. Our method also performs well in per-channel quantization. It improves the baseline by 0.41% mAP for 4-bit quantization and 1.09% mAP for 3-bit quantization. Generally, our method performs better when the bit width goes smaller.

 

W/A Method mAP(%) Size(MB) OPs(G)

 

32/32 FP 77.43 100.24 -
4/8 w/o Multipoint 72.86 12.53 15.69
Ours 74.10 12.63 17.58
3/8 w/o Multipoint 42.56 9.40 11.76
Ours 46.94 9.41 12.18
(a) Per-layer Quantization

 

W/A Method mAP(%) Size(MB) OPs(G)

 

32/32 FP 77.43 100.24 -
4/4 w/o Multipoint 73.17 12.53 7.843
Ours 73.58 12.62 8.636
3/3 w/o Multipoint 59.37 9.40 4.412
Ours 60.46 9.43 4.733
(b) Per-channel Quantization
Table 4: Post-training quantization result on SSD-VGG16 (mAP=mean average precision, FP=Full-Precision). Bold refers to the method with highest mAP.
Top-1 Accuracy/%
    OPs/G
Figure 5: The trade-off between computational cost and performance of a quantized ResNet-101 (W4A8). ‘Baseline’ is the naive QNN without multipoint quantization (OPs=3.841G). ‘Random’ uses multipoint quantization but channels are randomly added, while our method adds channels according to their output error.
Relative Increment of Size
    Index of Layers
Figure 6: Relative increment of size in each layer of a quantized ResNet-101 with multipoint quantization. The layers close to the input have large relative increment, while the layers after the 30-th layer have only negligible increment.

4.3 Analysis of the Algorithm

We provide a case study of ResNet-101 under per-layer quantization to analyze the algorithm. More results can be found in the appendix.

Computation Overhead and Performance: Fig. 5 demonstrates how the performance of different methods changes as the computational cost changes. Our method obtains huge gain with only a little overhead. OCS cannot perform comparably with our method at the beginning, but it catches up when the computational cost is large enough. The performance of ‘Random’ is consistently the worst among all three methods, implying the importance of choosing appropriate channels for multipoint quantization.

Where Multipoint Quantization is Applied: Fig 6 shows the relative increment of size in each layer. We observe that the layers close to the input have more relative increment of size compared with later layers. Typically, the starting layers have small size but huge computational cost. This explains why the computational overhead is large than the memory overhead when using our method.

5 Related Works

Quantized neural networks has made significant progress with training (courbariaux2015binaryconnect; han2015deep; zhu2016trained; rastegari2016xnor; mishra2017wrpn; neta_zmora_2018_1297430; cheng2018recent; krishnamoorthi2018quantizing; li2019additive). The research of post-training quantization is conducted for scenarios when training is not available (krishnamoorthi2018quantizing; meller2019same; banner2019post; zhao2019improving). Hardware-aware Automated Quantization (wang2019haq) is a pioneering work to apply mixed precision to improve the accuracy of QNN, which needs fine-tuning the network. It inspired a line of research of training a mixed precision QNN (gong2019mixed; Dong_2019_ICCV). banner2019post first exploits mixed precision to enhance the performance of post-training quantization.

6 Conclusions

We propose multipoint quantization for post-training quantization, which is hardware-friendly and effective. It performs favorably compared with state-of-the-art methods.

References

Appendix A Proof of Theorem 1

Notice that in the main text, we define as the optimal solution of each iterations (see Equ (6)). While this can not be solved and in practice we use and . This slightly abuses the notation as the two and are actually different. We do this mainly for notation simplicity in the main text. In the proof we distinguish the notations. We use and in this proof.

In our proof, we only consider the simplest case when , which means . It can be generalized to easily. Define , and , where denotes the convex hull of set . It is obvious that is an interior point of . Now we define the following intermediate update scheme. Given the current residual vector , without loss of generality, we assume all the elements of are different (if some of them are equal, we can simply treat them as the same elements). We define

Notice that as the objective is linear and we thus have

Without loss of generality, we assume , as in this case, we have and the algorithm should be terminated. Simple algebra shows that . Notice that as we assume , we have . This gives that

Hence the optimal solution under the constraint of is also . Given the current residual vector, we also define

By the definition, we have We have the following inequalities:

Notice that as we showed that is an interior point of , we have , for some . This gives that

We define

And it is obvious that we have . Next we bound the difference between and Notice that for any , we have

Without loss of generality we assume that , for any . Without loss of generality, we also suppose that . Under the assumption of grid search, there exists in the search space such that . For any , if , then . Now we consider the case of . By the assumption that , for any , we have

This gives that

Here the last inequality is from the assumption that . Thus we have for any , . The case for is similar by choosing . This concludes that we have in the search region such that and Thus we have

for some constant . We have

This gives that

Apply the above inequality iteratively, we have

Appendix B Experiment Details

We provide more details of our algorithm in the experiments. For per-layer and per-channel quantization, the optimal clipping factor are obtained by uniform grid search from . For the first and last layer, we search for the optimal clipping factor on weights from

. The optimal clipping factors for weights are obtained before performing multipoint quantization and we keep them fixed afterwards. For fair comparison, the quantization of the Batch Normalization layers are quantized in the same way as the baselines. When comparing with OCS, the BN layers are not quantized. When comparing with  

(banner2019post), the BN layers are absorbed into the weights and quantized together with the weights. Similar strategy for SSD quantization is adopted, i.e., the BN layers are kept full-precision for per-layer setting and absorbed in the per-channel setting.

The hyperparameter for different networks in different settings are listed in Table 5 and Table 6.

Network VGG19-BN ResNet-18 ResNet-101 WideResNet-50 Inception-v3 Mobilenet-v2
50 15 0.25 1 100 10
Table 5: for per-layer quantization (W/A = 4/8)
Network VGG19-BN ResNet-18 ResNet-50 ResNet-101 Inception-v3 Mobilenet-v2
10 8 0.7 0.2 50 1
Table 6: for per-channel quantization (W/A = 4/4)

Appendix C 3-bit Quantization

We present the results of 3-bit quantization in this section. 3-bit quantization is more aggressive and the accuracy of the QNN is typically much lower than 4-bit. As before, we report the results of per-layer quantization and per-channel quantization. All the hyper-parameters are the same as 4-bit quantization except for .

Model Bits (W/A) Method Acc (Top-1/Top-5) Size OPs

 

VGG19-BN 32/32 Full-Precision 74.24%/91.85% 76.42MB - -
3/8 w/o Multipoint 4.71%/12.33% 7.16MB 7.315G -
Ours 20.58%/40.38% 7.22MB 8.648G 100

 

ResNet-18 32/32 Full-Precision 69.76%/89.08% 42.56MB - -
3/8 w/o Multipoint 9.83%/24.89% 3.99MB 635.83M -
Ours 26.16%/49.29% 4.01MB 714.53M 100

 

WideResNet-50 32/32 Full-Precision 78.51%/94.09% 262.64MB - -
3/8 No Boosting 4.36%/10.64% 23.87MB 4.229G -
Ours 18.43%/35.34% 23.97MB 4.554G 5

 

Table 7: Per-layer quantization on ImageNet Benchmark (W=Weight, A=Activation, M=, G=, Acc=Accuracy)
Model Bits (W/A) Method Acc (Top-1/Top-5) Size OPs

 

VGG19-BN 32/32 Full-Precision 74.24%/91.85% 76.42MB - -
3/3 w/o Multipoint 0.10%/0.492% 7.16MB 2.743G -
Ours + Clip 65.81%/87.25% 7.19MB 3.099G 50

 

ResNet-18 32/32 Full-Precision 69.76%/89.08% 42.56MB - -
3/3 w/o Multipoint 0.11%/0.55% 3.99MB 238.44M -
Ours + Clip 43.75%/69.16% 4.06MB 265.90M 20

 

MobileNet-v2 32/32 Full-Precision 71.78%/90.19% 8.36MB - -
3/3 w/o Multipoint 0.11%/0.64% 0.78MB 42.12M -
Ours+Clip 5.21%/14.33% 0.79MB 58.65M 50

 

Table 8: Per-channel quantization on ImageNet Benchmark (W=Weight, A=Activation, M=, G=, Acc=Accuracy)

Appendix D More Visualization

We provide more experiment results for analyzing our algorithm in 4-bit quantization. Specifically, we provide the results of per-layer quantization of WideResNet-50 (Fig. 7 and Fig. 8) and per-channel quantization of ResNet-18 (Fig. 9 and Fig. 10).

Top-1 Accuracy/%
    OPs/G
Figure 7: The trade-off between computational cost and performance of a per-layer quantized WideResNet-50 (W4A8).
Relative Increment of Size
    Index of Layers
Figure 8: Relative increment of size in each layer of a per-layer quantized WideResNet-50 with multipoint quantization.
Top-1 Accuracy/%
    OPs/M
Figure 9: The trade-off between computational cost and performance of a per-channel quantized ResNet-18 (W4A4).
Relative Increment of Size
    Index of Layers
Figure 10: Relative increment of size in each layer of a per-channel quantized ResNet-18 with multipoint quantization.