Near-Lossless Post-Training Quantization of Deep Neural Networks via a Piecewise Linear Approximation

01/31/2020 ∙ by Jun Fang, et al. ∙ Microsoft 0

Quantization plays an important role for energy-efficient deployment of deep neural networks (DNNs) on resource-limited devices. Post-training quantization is crucial since it does not require retraining or accessibility to the full training dataset. The conventional post-training uniform quantization scheme achieves satisfactory results by converting DNNs from full-precision to 8-bit integers, however, it suffers from significant performance degradation when quantizing to lower precision such as 4 bits. In this paper, we propose a piecewise linear quantization method to enable accurate post-training quantization. Inspired from the fact that the weight tensors have bell-shaped distributions with long tails, our approach breaks the entire quantization range into two non-overlapping regions for each tensor, with each region being assigned an equal number of quantization levels. The optimal break-point that divides the entire range is found by minimizing the quantization error. Extensive results show that the proposed method achieves state-of-the-art performance on image classification, semantic segmentation and object detection. It is possible to quantize weights to 4 bits without retraining while nearly maintaining the performance of the original full-precision model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks (DNNs) have achieved state-of-the-art results in a variety of learning tasks including image classification [46, 21, 22, 45, 18, 44, 26], segmentation [4, 17, 41] and object detection [31, 39, 40]. Scaling up DNNs by one or all dimensions [46] of network depth [18], width [49] or image resolution [27] attains better accuracy, at a cost of higher computational complexity and increased memory requirement, which makes it difficult to deploy on embedded devices with limited resources.

One feasible way to deploy DNNs on embedded systems is quantization of full-precision (32-bit floating-point, FP32) weights and activations to lower precision (such as 8-bit, INT8) integers [23]. By decreasing the bit-width, the number of discrete values is reduced, while the quantization error, that generally correlates with model performance degradation increases. To minimize the quantization error and maintain the performance of a full-precision model, majority of recent studies [52, 3, 35, 23, 5, 50, 11, 24] rely on training either from scratch (“quantization-aware” training) or by fine-tuning a pre-trained floating-point model.

However, post-training quantization (PTQ) is crucial since it does not require retraining or accessibility to the full training dataset. It saves time-consuming fine-tuning effort, protects data privacy, and allows for easy and fast deployment of DNN applications. Among various PTQ schemes proposed in the literature [25, 6, 51], uniform quantization is the most popular approach to quantize weights and activations since it discretizes the domain of values to evenly-spaced low-precision integers which can be implemented on commodity hardware’s integer-arithmetic units.

Recent work [25, 28, 36] shows that PTQ based on a uniform scheme with INT8 is sufficient to preserve near original FP32 pre-trained model performance for a wide variety of DNNs. However, ubiquitous usage of DNNs requires even lower bit-width to achieve higher energy efficiency and smaller models. In lower bit-width scenarios, such as 4-bit, post-training uniform quantization causes a significant accuracy drop [25, 51]. This is mainly due to the fact that the distributions of weights and activations of pre-trained DNNs can be well approximated by Gaussian/Laplacian models [16, 30]. That is, most of the weights are clustered around zero while few of them are spread in a long tail. As a result, when operating in the low bit-width regime, uniform quantization assigns too few quantization levels to small magnitudes and too many to large ones, which leads to a significant accuracy degradation [25, 51].

To take advantage of the fact that the weights and activations of pre-trained DNNs typically have bell-shaped distributions with long tails, we propose a piecewise linear method to break the entire quantization range into two non-overlapping regions where each region is assigned an equal number of quantization levels. We restrict the number of regions to two to limit the complexity of the proposed approach and of the hardware overhead. The optimal breakpoint that divides the entire range can be found by minimizing the quantization error. Compared to uniform quantization, piecewise linear quantization provides a richer representation that reduces the quantization error under the same number of quantization levels. This indicates its potential to reduce the gap between floating-point and low-bit precision models.

The main contributions of our work are as follows:

  • We propose a piecewise linear quantization scheme for fast and efficient deployment of pre-trained DNNs without requiring retraining or access to the full training dataset. In addition, we investigate its impact on the hardware implementation.

  • We present a solution to find the optimal breakpoint and theoretically demonstrate that our method achieves lower quantization error than uniform quantization.

  • We provide a thorough evaluation on image classification, semantic segmentation and object detection benchmarks and demonstrate that our method achieves state-of-the-art results for 4-bit quantized models.

2 Related work

There is a wide variety of approaches in the literature that facilitate the efficient deployment of DNNs. The first group of techniques relies on designing network architectures that depend on more efficient neural network building blocks. Notable examples include depth/point-wise layers [43, 46] as well as group convolutions [33]. These methods require domain knowledge, training from scratch and full access to the task datasets. The second group of approaches optimizes neural network architectures in typically a task-agnostic fashion and may or may not require (re)training. Weight pruning [16, 29, 19, 32], activation compression [9, 8, 13], knowledge distillation [20, 37] and quantization [7, 53, 23] fall under this category.

In particular, quantization of activations and weights [14, 15, 38, 55, 47, 30, 5, 50] leads to model compression and acceleration as well as to overall savings in power consumption. Model parameters can be stored in a fewer number of bits while the computation can be executed on integer-arithmetic units rather than on power-hungry floating-point ones [23]. There has been extensive research on quantization with and without (re)training. In the rest of this section, we focus on post-training quantization that directly converts full-precision pre-trained models to their low-precision counterparts.

Recent works [25, 28, 36] have demonstrated that 8-bit quantized models have been able to accomplish negligible accuracy loss for a variety of different networks. To improve accuracy, per-channel (or channel-wise) quantization is introduced in [25, 28] to address variations of the range of weight values across channels. Weight equalization/factorization is applied by [34, 36]

to rescale the difference of weight ranges between different layers. In addition, bias shifts in the mean and variance of quantized values are observed and counteracting methods are suggested by

[1, 12, 36]. A comprehensive evaluation of post-training clipping techniques is presented by [51]

along with an outlier channel splitting method to improve PTQ performance. Moreover, adaptive processes of assigning different bit-width for each layer are proposed in

[30, 54] to optimize the overall bit allocation.

There are also a few attempts to tackle 4-bit PTQ by combining multiple techniques. In [1], a combination of analytical clipping, bit allocation and bias correction is used, while [6] minimizes the mean squared quantization error by representing one tensor with one or multiple 4-bit tensors as well as by optimizing the scaling factors.

Most of the aforementioned works utilize a linear or uniform quantization scheme. However, linear quantization cannot capture the bell-shaped distribution of weights and activations, which results in sub-optimal solutions. To overcome this deficiency, [2]

proposes a quantile-based method to improve accuracy but their method works efficiently only on highly customized hardware. Instead, we propose a piecewise linear approach that improves over the state-of-the-art results, and can also be implemented efficiently with minimum modification to commodity hardware.

3 Quantization scheme

Quantization schemes aim to reduce the memory and energy footprints of a DNN by converting full-precision weights and activations into low-precision representations. In this section, we review a uniform quantization scheme and discuss its limitations. We then present our novel piecewise linear quantization scheme and show that it has a smaller quantization error compared to the uniform scheme.

3.1 Uniform quantization scheme

Uniform quantization scheme linearly maps full-precision real numbers into low-precision integer representations. From [23, 6], the approximated version from uniform quantization at bit-width can be defined as:

(1)

where is the quantization range, is the offset, is the scaling factor, is the number of quantization levels, is the quantized integer from a rounding function followed by saturation to the integer domain . We set the offset for symmetric signed distributions combined with and

for asymmetric unsigned distributions (e.g., ReLU-based activations) with

. Since the scheme (1) introduces a quantization error defined as , the expected quantization error squared is given by:

(2)

with

under uniform distributions 

[48].

From the above definition, uniform quantization divides the range evenly despite the distribution of . Empirically, the distributions of weights and activations of pre-trained DNNs are similar to bell-shaped Gaussian/Laplacian distributions [30]. Therefore, uniform quantization is not always able to achieve small enough quantization errors to maintain model accuracy, especially in low-bit cases. In the following subsection, we propose a novel piecewise linear quantization scheme and show that it has stronger representational power (smaller quantization error) compared to uniform quantization.

3.2 Piecewise linear quantization scheme

To take advantage of bell-shaped distributions, piecewise linear quantization breaks the quantization range into two non-overlapping regions: the dense, centered region and the sparse, high-magnitude region. We chose to use two regions to maintain simplicity in the inference algorithm and in the hardware implementation.

Therefore, we only consider one breakpoint to divide the quantization range into two symmetric regions: the center region and the tail region . Both regions consist of a negative piece and a positive piece. Within each of the four pieces, -bit uniform quantization (1) is applied. We then define the piecewise linear quantization scheme as:

(3)

where is denoted as the sign of . For unsigned values, such as the outputs of ReLU activations, only the two positive pieces of the above scheme are used. The piecewise quantization error is defined as .

Figure 1 shows the comparison between uniform and piecewise linear quantization schemes on the empirical distribution of the layer weights in a pre-trained Inception-v3 model [45].

Figure 1: Quantization of layer weights in the Inception-v3 pre-trained model. Left: -bit uniform quantization. Right: -bit piecewise linear quantization. Both of them have the same number of quantization levels. Dotted line indicates the breakpoint.

Suppose

has a symmetric probability density function (PDF)

on a bounded domain

with the cumulative distribution function (CDF)

satisfying and . Then, we calculate the expected piecewise quantization error squared from (2) based on the error of each piece:

(4)

Since

for a symmetric probability distribution, Equation (

4) can be simplified as:

(5)

The performance of a piecewise quantized model crucially depends on the value of the breakpoint . If , then -bit piecewise quantization is essentially equivalent to -bit uniform quantization, because the four pieces have equal quantization ranges and bit-widths. If , the center region has a smaller range and greater precision than the tail region, as shown in Figure 1. Conversely, if , the tail region has greater precision than the center region. To reduce the overall quantization error for bell-shaped distributions found in DNNs, we increase the precision in the center region and decrease it in the tail region. Thus, we limit the breakpoint to the range .

Accordingly, the optimal breakpoint can be estimated by minimizing the expected squared quantization error:

(6)

Since bell-shaped distributions tend to zero as becomes large, we consider a smooth is decreasing when is positive, i.e., , . Then we prove that the optimization problem (6) is convex with respect to the breakpoint . Therefore one unique exists to minimize the quantization error (5), as demonstrated by the following Lemma 1.

Lemma 1.

If for all , then is a convex function of the breakpoint .

Proof.

Taking the first and second derivatives of (5) yields

(7)
(8)

Since and , , then . Hence is convex. ∎

In practice, we can find the optimal breakpoint by solving the convex problem (6) with assumption of normalized Gaussian or Laplacian distributions by approximations of or , respectively. Alternatively, we also provide a simple coarse-to-fine grid search algorithm without any assumption of specific distributions of (see Section  5.1.1).

3.3 Error analysis

Once the optimal breakpoint is found, both Lemma 2 and the numerical simulation in Figure 2 show that -bit piecewise linear quantization achieves a smaller quantization error than -bit uniform quantization, even though both schemes provide the same number of quantization levels. This indicates that piecewise linear quantization has a stronger representation power.

Figure 2: Mean squared quantization error (MSE) of uniform and piecewise linear schemes on an Inception-v3 weights for various bit-width values (). MSE of piecewise linear quantization is convex the breakpoint , the -bit piecewise linear scheme can achieve a smaller quantization error than the -bit uniform scheme.
Lemma 2.

Under the relaxation of , then .

Proof.

Given , the -bit uniform quantization error is

(9)

For -bit piecewise linear quantization, we solve the convex problem (6) by letting the first derivative equals zero in (7), and determine that the optimal breakpoint satisfies

(10)

By substituting  (10) in (5) and simplifying, we obtain

(11)

Subtract the above from (9) to obtain

(12)

Because, in actuality, is slightly less than , the error from -bit piecewise linear quantization can be slightly larger than the error from -bit uniform quantization if the distribution is insufficiently bell-shaped; that is, when the distribution is approximately uniform. However, in practice, simulation results show that the optimal -bit piecewise linear quantization error is less than the -bit uniform quantization error for every layer in Inception-v3. Figure 2 shows representative results for the layer; each layer in Inception-v3 generates similar results for 2-, 4-, and 6-bit piecewise linear quantization.

4 Hardware impact

In this section, we discuss the hardware requirements for efficient deployment of DNNs quantized with piecewise linear scheme. In neural networks, every output can be computed using a vector inner product between vector

and vector , which are part of the input activation tensor and part of the weight tensor, respectively. From scheme (1), the approximated versions of uniform quantization are and , where and are quantized integer vectors from and , respectively, is an identity vector, , and , are associated constant-valued scaling factors and offsets, respectively. An output of uniform quantization follows

(13)

where is defined as vector inner product and is denoted as the constant terms that can be pre-computed offline. Therefore, a uniformly quantized DNN requires two accumulators: one for sum of acitvation and weight products and one for sum of activations .

As we mentioned in Section 3.2, piecewise linear quantization breaks the weight ranges into two regions, which requires separate computation paths ( and ) as they have different scaling factors. For these regions, we set offsets and denote scaling factors by and in and , respectively. We also define by the associated partial of vector inner product, and the associated quantized integer vector from the element-wise absolute value version of in region for . Then is computed using the following equation,

(14)

and has some extra terms as it has non-zero offset ,

(15)

where is element-wise product of vectors, is a sign vector of , and are constant terms, which can be pre-computed similar to in (13).

As suggested by equations (14) and (15), one extra accumulation register is needed to sum up the activations that are multiplied by the non-zero offset introduced for weights in region (). In short, an efficient hardware implementation of piecewise quantization requires:

  • adders and multipliers similar to uniform quantization,

  • three accumulators: one of each for sum of products in and , and another one for activations in ,

  • and two extra bits for storage per weight tensor as one for the sign and one for indicating the region. Note that the latter bit does not increase the multiply-accumulate (MAC) computation and it is only used to find the appropriate accumulators.

Based on the above explanation, it is clear that more breakpoints require more accumulation registers and more bits per weight tensor. In addition, applying piecewise linear quantization on both weights and activations requires accumulators for each combination of activation regions and weight regions, which translates to more hardware overhead. As a result, we recommend one breakpoint for piecewise linear quantization and only on the weight tensor.

5 Experiments

We evaluate the robustness of piecewise linear method for post-training quantization on popular networks of several computer vision benchmarks: ImageNet classification

[42], semantic segmentation and object detection on Pascal VOC challenge [10]

. In all experiments, we apply batch normalization folding

[23] before quantization. For activations, we follow the profiling strategy in [51] to sample from 512 training images, and collect top-10 median (which we believe it is more stable for low-bit quantization) of min-max ranges at each layer. During inference we apply quantization after clipping with these ranges. Unless stated otherwise we quantize all network weights per-channel and activations as well as pooling layers per-layer

. We perform all experiments in Pytorch library with its pre-trained models.

5.1 Ablation study on ImageNet

In this section we conduct experiments on ImageNet classification challenge [42] and investigate the effect of our proposed piecewise linear quantization method. We evaluate the performance on ImageNet validation dataset for three popular network architectures: ResNet-50 [18], Inception-v3 [45] and MobileNet-v2 [43]. We use torchvision111https://pytorch.org/docs/stable/torchvision and its pre-trained models for our experiments.

5.1.1 Optimal breakpoint inference

In order to apply piecewise linear quantization, we first need to find the optimal breakpoint. As stated in Section 3.2, one way is to assume weights and activations satisfy Gaussian or Laplacian distributions, then we use formulated PDF/CDF to solve the equation (10) numerically by

  • binary search, since (7) is monotonically increasing;

  • or approximations: or for normalized Gaussian or Laplacian, respectively.

Under the assumption of Gaussian or Laplacian distributions, experimental results show that the approximation of breakpoint obtains almost the same top-1 accuracy compared to binary search on piecewise quantized model. However, more importantly, the approximation at one single shot of the optimal breakpoint is much faster than the binary search. Thus, if tensor values are under Gaussian or Laplacian assumption, unless stated otherwise we use piecewise linear method with approximated version of optimal breakpoint for the rest of this paper. Piecewise linear quantization methods under the assumption of Gaussian and Laplacian distributions are denoted by PWG and PWL, respectively.

Another alternative approach is a simple three-stage coarse-to-fine grid search algorithm to minimize quantization error without any need of distribution assumptions:

  • search best ratio (BR) in [0.1: 0.1: 1.0],

  • search BR in [: 0.01: ],

  • search BR in [: 0.001: ].

Compared to the previous two methods PWG and PWL, coarse-to-fine search (PWS) generally guarantees to achieve smaller quantization error at a cost of longer offline search time. Overall, smaller quantization error correlates lower loss of quantized model performance, however, we point out that smaller quantization error on weights is not always necessary indicating higher accuracy of quantized model. In general, finding the optimal breakpoint is a convex problem and we believe there are other potential solutions to further improve the results and we leave it for future work.

5.1.2 Comparison to uniform quantization

In section 3.3, we analytically and numerically demonstrate that -bit piecewise linear scheme obtains smaller quantization error than -bit uniform scheme. We expect -bit piecewise quantization performs as equal accurate as or even outperforms -bit uniform quantization.

We compare these two schemes in Table 1. In this table, weights are quantized per-channel into 2-to-8 bit with uniform or piecewise linear scheme, activations are uniformly quantized into 8-bit per-layer. Generally, -bit ( = 2, …, 6) piecewise linear quantization achieves higher accuracy than -bit uniform quantization except for two minor cases of MobileNet-v2 at 2-bit and 6-bit. When the bit-width goes down to 4-bit, piecewise linear quantization for ResNet-50 and Inception-v3 maintains near full-precision model’s accuracy within 0.42% of degradation. Note that PWL rarely achieves the highest accuracy number, thus we only report PWG and PWS for the rest of the paper.

Overall, the results in Table 1 show that piecewise linear quantization is a more powerful representation scheme in terms of both quantization error and quantized model accuracy, making it a promising alternative for uniform quantization in low bit-width cases. In addition, piecewise linear quantization applies uniform quantization in each piece, hence it can benefit from any tricks that improve uniform quantization performance. Here we discuss two tricks: bias correction and per-group quantization.

Network Weight 8-bit 7-bit 6-bit 5-bit 4-bit 3-bit 2-bit
Res-50
(76.13)
Uni 76.10 75.93 75.61 71.90 65.48 8.52 0.1
PWG 76.08 76.08 76.12 76.15 75.51 74.13 67.05
PWL 76.07 76.06 76.08 76.16 75.62 73.49 64.21
PWS 76.13 76.02 76.10 75.96 75.71 73.46 63.02
Incep-v3
(77.49)
Uni 77.53 77.34 76.87 74.01 44.28 0.19 0.1
PWG 77.53 77.58 77.51 77.32 77.15 75.33 53.99
PWL 77.55 77.49 77.50 77.39 77.14 75.74 47.53
PWS 77.57 77.52 77.58 77.31 76.83 75.65 49.13
MB-v2
(71.88)
Uni 71.35 70.63 67.76 30.68 11.37 0.13 0.1
PWG 71.68 71.67 71.24 70.98 67.93 49.52 3.46
PWL 71.74 71.59 71.28 70.67 67.95 44.43 4.24
PWS 71.71 71.71 71.30 71.44 68.02 58.92 8.91
Table 1: Comparison results of top-1 accuracy (%) for uniform and piecewise linear quantization on weights. Uni: uniform quantization; PWG/PWL/PWS: piecewise linear quantization with breakpoint found by Gaussian/Laplacian/Search method. Activations are uniformly quantized into 8-bit per-layer. Each bold value indicates the best result obtained from different quantization methods for specified bit-width and network.

Bias correction. An inherent bias in the mean and the variance of the tensor values was observed after the quantization process, the benefits of correcting this bias term have been demonstrated in [1, 12, 36]. This bias can be compensated by folding some correction terms into the scale and the offset [1]. We adopt this idea into our piecewise linear method and show the results in Table 2. Applying bias correction further improves the performance of low-bit quantized models with a large margin, especially for MobileNet-v2 (MB-v2). It allows 4-bit piecewise post-training quantization of MB-v2 achieves near full-precision accuracy with a drop of 0.63%, which is state-of-the-art result for low-bit quantization of MB-v2, to the best of our knowledge.

Per-group quantization. Per-channel quantization applies quantization per output channel filters, individually. This allows more accurate approximations of the original weight tensors and significantly improves network accuracy [25, 28]. We extend this technique and propose per-group quantization that decomposes each output channel filter into multiple groups and quantizes them separately, to further reduce the quantization error.

To be more specific, we consider a weight tensor , where and are output and input channel size, and are kernel width and height size, respectively. Different from per-channel quantization on each filter tensor , per-group quantization is applied on each sub-tensor which we call as a group, where , , and is the group size. Note that changing from one group to another needs an extra step of changing the scaling factor. To keep this overhead small, we limit the group size to if else .

Comparison of -bit (per-channel) and +Gr (per-group) in Table 2 shows that per-group quantization indeed generally improves the quantized model, which indicates the input-channel range variations also exist in pre-trained DNNs. In general, a combination of low-bit piecewise linear quantization, bias correction and per-group on weights, achieves minimal loss of full-precision model performance.

Network Weight 4-bit 4+B 4+Gr 4+BGr 2-bit 2+B 2+Gr 2+BGr
Res-50
(76.13)
Uni 65.48 72.75 66.13 73.51 0.10 0.15 0.10 0.20
PWG 75.51 75.94 75.46 76.00 67.05 73.30 67.47 73.87
PWS 75.71 76.04 75.80 76.06 63.02 73.94 63.55 74.38
Incep-v3
(77.49)
Uni 44.28 62.45 49.38 64.91 0.10 0.09 0.08 0.11
PWG 77.15 77.25 77.24 77.37 53.99 66.80 59.44 67.84
PWS 76.83 77.06 76.94 77.09 49.13 62.33 54.64 64.34
MB-v2
(71.88)
Uni 11.37 41.80 13.74 39.90 0.09 0.12 0.10 0.10
PWG 67.93 71.25 67.85 71.18 3.46 48.42 3.91 49.27
PWS 68.02 71.12 68.21 71.21 8.91 57.54 9.82 58.53
Table 2: Effect on top-1 accuracy (%) of bias correction and per-group quantization on weights. For , +B: -bit with bias correction; +Gr: -bit with per-group quantization; +BGr: -bit with bias correction and per-group quantization. We refer Table 1 for the notation of Uni, PWL and PWS, and we use the same notations for the rest of the paper.

5.2 Comparison to existing approaches

In this section, we evaluate our piecewise linear quantization method with other existing approaches, by quoting the reported performance scores from the original literature.

An inclusive evaluation of clipping techniques along with outlier channel splitting (OCS) was presented in [51]. To fairly compare with these methods, we adopt the same setup of applying per-layer quantization on weights and without quantizing the first layer. In Table 3, we show that -bit piecewise linear quantization with Gaussian breakpoint (no bias correction or per-group tricks) outperforms -bit best results of clipping method combined with OCS. Furthermore, OCS needs to change the network architecture and introduces extra overheads, in contrast to the piecewise linear approach.

Network
Bit
(W/A)
OCS
(0.05)
Clip
(Best)
OCS (0.05)
+ Best Clip
Piecewise
(Gaussian)
Res-50 32/32 76.1 76.1 76.1 76.1
8/8 -0.4 (75.7) -0.6 (75.5) -0.7 (75.4) -0.1 (76.0)
7/8 -0.5 (75.6) -0.9 (75.2) -0.6 (75.5) -0.0 (76.1)
6/8 -1.1 (75.0) -1.8 (74.3) -0.9 (75.2) -0.1 (76.0)
5/8 -3.5 (72.6) -6.2 (69.9) -2.7 (73.4) -0.2 (75.9)
4/8 -20.9 (55.2) -13.2 (62.9) -6.8 (69.3) -0.7 (75.5)
Incep-v3 32/32 75.9 75.9 75.9 77.5
8/8 -0.6 (75.3) -1.1 (74.8) -1.0 (74.9) -0.0 (77.5)
7/8 -1.2 (74.7) -2.7 (73.2) -1.7 (74.2) +0.1 (77.6)
6/8 -3.8 (72.1) -9.7 (66.2) -3.4 (72.5) -0.1 (77.4)
5/8 -15.7 (60.2) -35.4 (40.5) -13.0 (62.9) -0.3 (77.2)
4/8 -75.3 (0.6) -74.3 (1.6) -71.1 (4.8) -2.0 (75.5)
Table 3: Comparison results of per-layer piecewise linear quantization method and best clipping with OCS [51] on top-1 accuracy (%) loss. W/A indicate the bit-width of weight quantization and activation quantization, respectively. The accuracy difference values are measured from the full-precision (32/32) result.

In general, the MAC unit for -bit piecewise linear quantization requires -bit weight operand and bits to store the weight tensor. This extra overhead is due to the sign. If piecewise linear quantization is applied to unsigned tensors, such as ReLU-based activations, this extra bit is not needed. In order to fairly compare piecewise linear quantization with other -bit quantization in the prior works, we consider two cases: PW-Comp and PW-Stor. PW-Comp refers to a piecewise linear approach under the same bit-width MAC operands, which has bits per piece for each singed tensors and bits per piece for unsigned ones. PW-Stor refers to the case that the weight tensor occupies only bits, which has and bits, per piece, for signed and unsigned tensors, respectively.

In Table 4, we provide a comprehensive comparison result of the piecewise linear method to other existing quantization methods on ResNet-50 and Inception-v3. Here we apply per-layer quantization on activations and we combine bias correction with per-group quantization on weights. In terms of the same bit-width computation, Table 4 demonstrates that PW-Comp always achieve the best top-1 accuracy. On the contrary, if we restrict the quantized model having the same storage, PW-Stor generally is not optimal comparing with analytical clipping for integer quantization (ACIQ) [1] and low-bit quantization (LBQ) in [6]. However, ACIQ applies per-channel bit allocation for both weights and activations which mainly reduces memory footprints but not necessarily makes hardware implementation more efficient. LBQ maps one original tensor to multiple quantized tensors which introduces extra computation and storage simultaneously. We emphasize that our piecewise linear method is simple and efficient. It achieves the desired accuracy at the small cost of a few more accumulators per MAC. More importantly, it is applicable on top of other methods.

Network W/A PW-Comp PW-Stor QWP [25] ACIQ [1] LBQ [6] SSBD [34] QRD [28] UNIQ [2] DFQ [36] FBias [12]
Res-50 32/32 76.13 76.13 75.20 76.10 76.01 75.20 - 76.02 - -
8/8 -0.02(76.11) -0.01(76.12) -0.10(75.10) - - -0.25(74.95) - - - -
4/8 -0.40(75.73) -1.75(74.38) -21.2(54.00) -0.80(75.30) -1.03(74.98) - - -2.56(73.37) - -
4/4 -1.25(74.88) -6.76(69.37) - -2.30(73.80) -3.41(72.60) - - - - -
Incep-v3 32/32 77.49 77.49 78.00 77.20 76.23 77.90 77.97 - - -
8/8 +0.08(77.57) +0.09(77.58) 0.0(78.00) - - -0.03(77.87) -0.09(77.88) - - -
4/8 -0.98(76.51) -9.65(67.84) -7.00(71.00) -9.00(68.20) -1.44(74.79) - - - - -
4/4 -2.11(75.38) -21.57(55.92) - -10.8(66.40) -4.62(71.61) - - - - -
MB-v2 32/32 71.88 71.88 71.90 - - 71.80 71.23 - 71.72 71.80
8/8 -0.17(71.71) -0.22(71.66) -2.10(69.80) - - -0.61(71.19) -1.68(69.55) - -0.53(71.19) -1.20(70.60)
4/8 -2.64(69.24) -13.35(58.53) -71.80(0.10) - - - - - - -
Table 4: Comparison of piecewise linear quantization and other methods on top-1 accuracy (%) loss. PW-Comp: piecewise linear quantization with same computation cost, 8/8: 7-bit piecewise (7-pw) linear scheme on weights and 8-bit uniform (8-uni) scheme on activations, 4/8: 3-pw/8-uni, 3/4: 3-pw/4-pw; PW-Stor: piecewise linear quantization with same storage requirement, 8/8: 6-pw/8-uni, 4/8: 2-pw/8-uni, 4/4: 2-pw/3-pw. Weights are quantized per-group with bias correction, activations are quantized per-layer.

We also show the capability of piecewise linear scheme for low-bit quantization on MobileNet-v2 in Table 4. Under the same bit-width of computation or storage requirement among all the methods, our piecewise linear approach (PW-Comp or PW-Stor) combined with bias correction and per-group quantization achieves the best accuracy scores on both 8/8 and 4/8 cases.

5.3 Other applications

To show the robustness and applicability of our proposed approach, we extend the piecewise idea to other computer vision tasks including semantic segmentation on DeepLab-v3+ [4] and object detection on SSD [31].

5.3.1 Semantic segmentation

In this section, we apply the piecewise linear quantization for DeepLab-v3+ with a backbone of MobileNet-v2. The performance is evaluated with mean intersection over union (mIoU) on the Pascal VOC segmentation challenge [10].

In our experiments, we utilize the implementation of public Pytorch repository222https://github.com/jfzhang95/pytorch-deeplab-xception and evaluate the performance with its own pre-trained model. After folding batch normalization of the pre-trained model, we found that several layers of weight ranges become very large (e.g., [-54.4, 64.4]). Considering the fact that quantization range [24], especially in the early layers [6], has a profound impact on the performance of quantized models, we fix the configuration of some early layers in the backbone. More precisely, we apply 8-bit piecewise quantization on three depth-wise convolution layers with large ranges in all configurations shown in Table 5. Note that the MAC operations of these three layers are negligible in practice since they only contribute 0.2% of the entire network computation, but it is remarkably beneficial to the performance of low-bit quantized models.

As noticed in classification models, 4-bit uniform quantization causes significant performance drop from the full-precision version. In Table 5, applying the piecewise linear method combined with bias correction and per-group quantization, the 4-bit quantization model on weights even outperfoms 8-bit DFQ [36], which attains 0.58% degradation of the pre-trained model, indicating the potentiality of 4-bit post-training quantization of semantic segmentation task.

Network W/A 32/32 8/8 6/8 4/8
DeepLab-v3+
(mIoU%)
Uni 70.81 -0.66(70.15) -1.18(69.63) -19.52(51.29)
PWG 70.81 -0.21(70.60) -0.30(70.51) -1.01(69.80)
PWS 70.81 -0.12(70.69) -0.28(70.53) -0.58(70.23)
DFQ [36] 72.94 -0.61(72.33) - -
Table 5: Uniform and piecewise linear quantization of DeepLab-v3+ (backbone MobileNet-v2). The performance is evaluated with mean intersection over union (mIoU) on the Pascal VOC segmentation challenge. Weights are quantized per-group with bias correction, activations are quantized per-layer into 8-bit.

5.3.2 Object detection

We also test the piecewise linear scheme for the object detection task. The experiments are performed on the public Pytorch implementation333https://github.com/qfgaohao/pytorch-ssd of SSD-Lite version [31] with a base network of MobileNet-v2. The performance is evaluated with mean average precision (mAP) on the Pascal VOC object detection challenge [10].

Table 6 compares the results of the mAP score of quantized models using uniform scheme and piecewise linear scheme. Similar to classification and segmentation tasks, even with bias correction and per-group quantization, 4-bit uniform quantization causes 3.90% performance drop from the full-precision model, while 4-bit piecewise linear quantization is able to remove this notable gap down to 0.23%.

Network W/A 32/32 8/8 6/8 4/8
SSD-Lite
(mAP%)
Uni 68.70 -0.24(68.46) -0.47(68.23) -3.90(64.80)
PWG 68.70 -0.19(68.51) -0.21(68.49) -0.23(68.47)
PWS 68.70 -0.25(68.45) -0.27(68.43) -0.35(68.35)
DFQ [36] 68.47 -0.56(67.91) - -
Table 6: Uniform and piecewise linear quantization of SSD-Lite (backbone MobileNet-v2). The performance is evaluated with mean average precision (mAP) on the Pascal VOC object detection challenge. Weights are quantized per-group with bias correction, activations are quantized per-layer into 8-bit.

6 Conclusion

In this work, we present a piecewise linear approximation for post-training quantization of deep neural networks. It breaks the bell-shaped distributed values into two regions per tensor where each region is assigned equal number of quantization levels. We further analyze the resulting quantization error as well as the hardware requirements. We show that our approach achieves near-lossless 4-bit post-training quantization performance on image classification, semantic segmentation and object detection tasks. Therefore, it enables efficient and rapid deployment of computer vision applications on resource limited devices.

Acknowledgments. We would like to thank Hui Chen and Jong Hoon Shin for valuable discussions.

References

  • [1] R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry (2018) Post training 4-bit quantization of convolution networks for rapid-deployment. CoRR, abs/1810.05723 1, pp. 2. Cited by: §2, §2, §5.1.2, §5.2, Table 4.
  • [2] C. Baskin, E. Schwartz, E. Zheltonozhskii, N. Liss, R. Giryes, A. M. Bronstein, and A. Mendelson (2018) UNIQ: uniform noise injection for non-uniform quantization of neural networks. arXiv preprint arXiv:1804.10969. Cited by: §2, Table 4.
  • [3] Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017) Deep learning with low precision by half-wave gaussian quantization. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5918–5926. Cited by: §1.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1, §5.3.
  • [5] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1, §2.
  • [6] Y. Choukroun, E. Kravchik, and P. Kisilev (2019) Low-bit quantization of neural networks for efficient inference. arXiv preprint arXiv:1902.06822. Cited by: §1, §2, §3.1, §5.2, §5.3.1, Table 4.
  • [7] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §2.
  • [8] G. S. Dhillon, K. Azizzadenesheli, Z. C. Lipton, J. Bernstein, J. Kossaifi, A. Khanna, and A. Anandkumar (2018) Stochastic activation pruning for robust adversarial defense. arXiv preprint arXiv:1803.01442. Cited by: §2.
  • [9] X. Dong, J. Huang, Y. Yang, and S. Yan (2017) More is less: a more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5840–5848. Cited by: §2.
  • [10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §5.3.1, §5.3.2, §5.
  • [11] J. Faraone, N. Fraser, M. Blott, and P. H. Leong (2018) Syq: learning symmetric quantization for efficient deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4300–4309. Cited by: §1.
  • [12] A. Finkelstein, U. Almog, and M. Grobman (2019) Fighting quantization bias with bias. arXiv preprint arXiv:1906.03193. Cited by: §2, §5.1.2, Table 4.
  • [13] G. Georgiadis (2019)

    Accelerating convolutional neural networks via activation map compression

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7085–7095. Cited by: §2.
  • [14] Y. Gong, L. Liu, M. Yang, and L. Bourdev (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2.
  • [15] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In

    International Conference on Machine Learning

    ,
    pp. 1737–1746. Cited by: §2.
  • [16] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §5.1.
  • [19] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
  • [21] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
  • [22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
  • [23] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1, §2, §2, §3.1, §5.
  • [24] S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang, and C. Choi (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §1, §5.3.1.
  • [25] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1, §1, §2, §5.1.2, Table 4.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [27] W. Lai, J. Huang, N. Ahuja, and M. Yang (2018)

    Fast and accurate image super-resolution with deep laplacian pyramid networks

    .
    IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • [28] J. H. Lee, S. Ha, S. Choi, W. Lee, and S. Lee (2018) Quantization for rapid deployment of deep neural networks. arXiv preprint arXiv:1810.05488. Cited by: §1, §2, §5.1.2, Table 4.
  • [29] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §2.
  • [30] D. Lin, S. Talathi, and S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §1, §2, §2, §3.1.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §5.3.2, §5.3.
  • [32] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2.
  • [33] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §2.
  • [34] E. Meller, A. Finkelstein, U. Almog, and M. Grobman (2019) Same, same but different-recovering neural network quantization error through weight factorization. arXiv preprint arXiv:1902.01917. Cited by: §2, Table 4.
  • [35] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. (2017) Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: §1.
  • [36] M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling (2019) Data-free quantization through weight equalization and bias correction. arXiv preprint arXiv:1906.04721. Cited by: §1, §2, §5.1.2, §5.3.1, Table 4, Table 5, Table 6.
  • [37] A. Polino, R. Pascanu, and D. Alistarh (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
  • [38] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2.
  • [39] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [41] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §5.1, §5.
  • [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2, §5.1.
  • [44] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • [45] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §3.2, §5.1.
  • [46] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §2.
  • [47] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng (2016) Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4820–4828. Cited by: §2.
  • [48] Y. You (2010) Audio coding: theory and applications. Springer Science & Business Media. Cited by: §3.1.
  • [49] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §1.
  • [50] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §1, §2.
  • [51] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang (2019) Improving neural network quantization without retraining using outlier channel splitting. In International Conference on Machine Learning, pp. 7543–7552. Cited by: §1, §1, §2, §5.2, Table 3, §5.
  • [52] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: §1.
  • [53] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.
  • [54] Y. Zhou, S. Moosavi-Dezfooli, N. Cheung, and P. Frossard (2018) Adaptive quantization for deep neural network. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [55] C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.