FracBits: Mixed Precision Quantization via Fractional Bit-Widths

07/04/2020 ∙ by Linjie Yang, et al. ∙ ByteDance Inc. 0

Model quantization helps to reduce model size and latency of deep neural networks. Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths to achieve maximum efficiency. We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target computation constraints and model sizes. During the optimization, the bit-width of each layer / kernel in the model is at a fractional status of two consecutive bit-widths which can be adjusted gradually. With a differentiable regularization term, the resource constraints can be met during the quantization-aware training which results in an optimized mixed precision model. Further, our method can be naturally combined with channel pruning for better computation cost allocation. Our final models achieve comparable or better performance than previous quantization methods with mixed precision on MobilenetV1/V2, ResNet18 under different resource constraints on ImageNet dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural network quantization [3, 4, 12, 15, 21, 22, 23, 25, 31, 32] has attracted large amount of attention due to the resource and latency constraints in real applications. Recent progress on neural network quantization has shown that the performance of quantized models can be as good as full precision models under moderate target bit-width such as 4 bits [12]. Customized hardwares can be configured to support multiple bit-widths for neural networks [11]. In order to fully exploit the power of model quantization, mixed precision quantization strategies are proposed to strike a better balance between computation cost and model accuracy. With more flexibility to distribute the computation budgets across layers [4, 12, 25], or even weight kernels [15], the quantized models with mixed precision usually achieve favorable performance than the ones with uniform precision.

Current approaches for mixed precision quantization usually borrow ideas from neural architecture search (NAS) literature. Suppose we have a neural network with each convolution layer consisting of branches where each branch is the quantized convolution with different bit-width. Finding the best configuration for a mixed precision model can be achieved by preserving a single branch for each convolution layer and pruning all other branches, which is conceptually equivalent to some recent NAS algorithms that aim at searching sub-networks from a supergraph [2, 20, 24, 26]. ENAS [20] and SNAS [26]

employ reinforcement learning (RL) to learn a policy to sample network blocks from a supergraph. ReLeQ 

[4] and HAQ [23] follow this footprint and employ reinforcement learning to choose layer-wise bit-width configurations for a neural network. AutoQ [15] further optimizes bit-width of each convolution kernel using a hierarchical RL strategy. ProxylessNAS [2] and FBNet [24] adopt a path sampling method to jointly learn model weights and importance scores of each operation in the supergraph. DNAS [25] directly reuses this path sampling methods and adds a regularization term proportional to the computation cost or model size, in order to discover mixed precision models with a good trade-off between computational resources and accuracy. Uniform Sampling (US) [7]

uses uniform sampling to sample subnetworks from the supergraph in training and then searches for pruned or quantized models using evolutionary algorithm.

Methods HAQ[23] ReLeQ[4] AutoQ[15] DNAS[25] US[7] DQ[22] FracBits
differentiable search
one-shot
support kernel-wise quantization
support channel pruning
Table 1: A comparison of our approach and previous mixed quantization algorithms. Our method FracBits achieves one-shot differentiable search and supports kernel-wise quatization and pruning.

However, previous approaches on mixed precision quantization mostly directly adopts NAS algorithms and do not leverage specific properties of quantized models. Different from NAS and model pruning, the quantitative difference of weights and activations with similar bits is small. For example, choosing 4 or 5 bits for one weight matrix only generates around

difference in value, assuming weights are uniformly distributed on

with linear quantization scheme. Thus the transition from one bit to its neighboring bits can be considered as a differentiable operation with appropriate parameterization. Recently, DQ [22]

utilizes the Straight-Through Estimation 

[1]

to facilitate differentiable bit-switching by treating bit-width of each layer as continuous parameters. Here, we propose a new approach to treat the bit-widths as continuous values by interpolating quantized weights or activation values of its two nerighboring bit-widths. Such an approach facilitates an efficient one-shot differentiable optimization procedure of mixed precision quantization. By allocating differentiable bit-widths to layers or kernels, it can enable both layer-wise and kernel-wise quantization. A high-level comparison of our methods and previous mixed precision methods is shown in Table 

1.

In summary, our contribution of this work is threefold.

  • We propose a fractional bit-widths formulation which creates a smooth transition between neighboring quantized bits of network weights and activations, facilitating differentiable search in the layer-wise or kernel-wise precision dimension.

  • Our mixed precision quantization algorithm only needs one-shot training of the network, greatly reduces exploration cost for resource restrained tasks.

  • Our simple and straight-forward formulation is ready to be used for different quantization schemes. We showed superior performance than uniform precision approaches and previous mixed precision approaches on a wide range of model variants and with different quantization schemes.

2 Related Work

Quantized Neural Networks Previous quantization techniques can be categorized into two types. The first type named post-training quantization directly quantizes weights and activations of a pretrained full-precision model into lower bit [13, 18]. This type of methods typically suffer from significant performance degeneration, as the training progress is ignorant of the quantization procedure. Another type of techniques named quantization-aware training is proposed to incorporate quantization into training stage. Early studies in this direction employ a single precision for the whole neural network. For example, DoReFa [32]

proposes to transform the unbounded weights into a finite interval to reduce undesired quantization error introduced by infrequent large outliers. PACT 

[3] investigates the effect of clipping activations from different layers, finding the layer-dependence of the optimal clipping-levels. SAT [12] investigates the gradient scales in training with quantized weights, and further improves model performance by adjusting weight scales. As another direction, some work assigns different bit-widths to different layers or kernels, enabling more flexible computation budget allocation. The first attempts employ reinforcement learning technique with rewards from estimated memory and computational cost by formulas [4] or simulators [23]. AutoQ [15] modifies the training procedure into a hierarchical strategy, resulting in fine-grained kernel-wise quantization. However, these RL strategies needs to sample and train a large number of model variants which is very resource-demanding. DNAS [25] resorts to a differentiable strategy by constructing a supernet with each layer comprised by a linear combination of outputs from different bit-widths. However, due to the discrepancy between the search process and final configuration, it still needs to retrain the discovered model candidates. To further improve the searching efficiency, we propose a one-shot differentiable search method with fractional bit-widths. Due to the smooth transition between fractional bit-width and final integer bit-width, our method embeds the bit-width searching and model finetuning stages in a single pass of model training. Meanwhile, our technique supports kernel-wise quantization with channel pruning in the same framework by assigning 0 bit to the pruned channels, similar to [15] but through a differentiable approach with much reduced searching cost. It is also orthogonal to Uniform Sample (US) [7] for joint quantization and pruning, which trains a supernet by uniform sampling and searches good sub-architectures with evolutionary algorithm.

Network Pruning Network pruning is an orthogonal approach to speed up inference of neural networks to quantization. Early work [8] compresses bulky models by learning connection together with weights, which produces unstructured connection in the final network. Later, structured compression by kernel-wise [16] or channel-wise [5, 9, 14, 27] pruning is proposed, where the learned architecture is more friendly with acceleration on modern hardware. As an example, [14]

identifies and prunes insignificant channels in each layer by penalizing on the scaling factor of the batch normalization layer. More recently, NAS algorithms are leveraged to guide network pruning.

[28] presents a one-shot searching algorithm by greedily slimming a pretrained slimmable neural network [29]. [17] proposes a one-shot resource-aware searching algorithm using FLOPs as a L1 regularization term on the scaling factor of the batch normalization layer. We adopt a similar strategy to use BitOPs and model sizes as L1 regularization which are computed based on the trainable fractional bit-widths in our framework.

3 Mixed Precision Quantization

In this section, we will introduce our proposed method for mixed precision quantization. Our one-shot training pipeline involves two steps: bit-width searching and finetuning. We first introduce the implementation of fractional bit-width, and integration of the resource constraint in the searching process. After that, we introduce implementation of kernel-wise mixed precision jointly with channel pruning.

3.1 Searching with fractional bit-widths

In order to learn bit-widths dynamically in one-shot training, it is necessary to make them differentiable and define their derivative accordingly. To this end, we first examine a generic operation that quantizes a value to -bit. Typically, is well-defined only for positive integer values of . To generalize bit-width to an arbitrary positive real number , we apply first-order expansion around one of its nearby integer, and approximate the derivative at this integer by the slope of the segment joining the two adjacent grid points neighboring . Such a linear interpolation reads

(1)

where and denote the floor and ceiling function, respectively. In other words, we can approximate an operation with a fractional bit-width by a linear combination of two operations with integer bit-widths, thus naturally achieving differentiability on it and making it learnable through typical gradient-based optimization, such as SGD. Note that the approximation in Eq. (1) turns into a strict equality if the original operation is linear in or if takes an integer value. The basic idea is illustrated in Fig. 1. In Eq. (1), the two rounding functions floor and ceiling on bit-width has vanishing gradient with respect to the argument, and thus the derivative of Eq. (1) is given by

(2)

The difference of such an linear interpolation scheme compared to the widely-adopted straight through estimation (STE) [1] is that it uses soft bit-widths in both forward and backward propagation, rather than hard bit-widths in forward and soft bit-widths in back-propagation, as adopted by [22]. In this way, the computed gradient reflects the true direction that the network parameters need to evolve along which results in better convergence.

Figure 1: Our proposed differentiable bit-width searching method of searching with fractional bit-width and finetuning with mixed bit-width quantization.

Throughout we will adopt the DoReFa scheme for weight quantization, and the PACT scheme for activation quantization. The quantization function for both is the same, defined as

(3)

where , indicates rounding to the nearest integer, and equals where is the quantization bit-width. Thus, for both quantization, we have for integer bit-widths, and quantization with fractional bit-widths is implemented with Eq. (1). The weight quantization is given by , where is the transformed weight clamped to the interval ; activation quantization is given by , where is a learnable parameter and is the original activation clipped at . and are the learnable fractional bit-widths for weight and activation, respectively. Also, it is possible to privatize bit-width to each kernel, enabling kernel-wise mixed precision quantization, as discussed later in Section 3.4.

During the earlier searching stage, the precision assigned to each layer or each kernel is still undetermined, and we want to find the optimal bit-width structure through training. By initializing each bit-width with some arbitrary value, we can use Eq. (1) to quantize weights and activations in the model to fractional bit-widths. Meanwhile, this allows us to assign different bit-widths to different layers or even kernels, as well as to furnish separate precision for weight and activation quantization. During the training process, the model gradually converges to an optimal bit-width for both weight and activation corresponding to each unit, enabling quantization with mixed precision.

3.2 Resource constraint as penalty loss

Restricting storage or computation cost is essential for model quantization, as the original purpose of quantization is to save resource consumption when deploying bulky model on portable devices or embedded systems. To this end, previous work resort to constraining on different metrics during the optimization procedure, including memory footprints [22], model size [22, 23], BitOPs [7, 25] and even estimated latency or energy [15, 23]. Here, we focus on model size in bits (Bytes) for weight-only quantization, and the number of BitOPs for quantization on both weight and activation, as they can be directly calculated from assigned bit-widths. Note latency and energy consumption [15, 23] may seem to be more practical measures for real applications. However, we argue that BitOPs can also be a good metric since it is solely determined by the model itself rather than different configurations of hardwares, simulators and compilers, which guarantees fair comparison between different approaches and advocates reproducible research.

Weight-only quantization targets at shrinking the model size, while floating point operation is still needed during inference. Model size are usually expressed in terms of the required number of bits to store weights (and bias) in the model. For a weight of -bit, the size is simply . The generalized model size for a fractional bit-width is thus . The size of the whole model can be obtained by summing over all weights in the model. Note that the bit-width can be shared among all weights in the whole layer or along each kernel (as discussed later in Section 3.4), corresponding to layer-wise or kernel-wise quantization, respectively. For example, for a typical 2D convolution layer (without grouping) sharing the same fractional bit-width among all weights, the size is given by , where is the number of input channels, is the number of output channels, and and represent the horizontal and vertical kernel sizes, respectively.

Quantization on both weights and activations can effectively decrease computation cost for real application, which can be measured with number of BitOPs involved in multiplications. Suppose a weight value and an activation value involved in multiplication are quantized to -bit and -bit, respectively. The number of BitOPs for such a multiplication is

(4)

This expression is bi-linear in and , which means that for fractional bit-widths and Eq. (1) leads to

(5)

The total computation cost of the model is the sum over all weights and activations. As for the example of 2D convolution layer, if all weights share the same fractional bit-width and all input activations share the same fractional bit-width , the number of BitOPs is given by , where and represents the horizontal and vertical sizes of the output features, respectively.

Targeting prescribed objective With constraints defined properly, we are able to penalize on them to enable constraint-aware optimization. Here, we directly define the penalty term as the L1 difference from some target constraint value by

(6a)
(6b)

where and denote target constraints for model size and computation cost, respectively. The sum is taken over all weights in the model for model size constrained optimization, and is taken over all weights and all activations for computation cost constrained case.

Adding the penalty term to the original loss (such as cross entropy for classification task) with a coefficient , we arrive at the total loss for optimization

(7a)
(7b)

It should be noted that the value of depends on the unit of constraints. Throughout the paper, we measure model size in terms of MB (megabytes) and computation cost in terms of GBitOPs (billion of BitOPs). In this way, the desired resource constraint can be reached in the joint optimization of model parameters and bit-widths. Note that the recent concurrent work [19] adopts a similar approach for mixed precision quantization with L1 regularzation on bit-widths for weights and activations, while here we explicitly define the loss as a function of computational cost in BitOPs or model size in Bytes and incorporate the target constraint into the loss directly.

3.3 Finetuning with mixed precision

After searching, we freeze the bit-widths by rounding them to the nearest integer values and disabling their gradient. This way, each layer or each kernel has its individual bit-widths for weights and activations learned in the previous stage, and the training enters the finetuning stage to only update model weights. The ratio between training epochs allocated to searching and finetuning is a hyper-parameter that can be freely specified. In practice, we assign

of training epoches to searching and to finetuning. Here we want to emphasize that the combination of searching and finetuning constitutes the whole training procedure, and the total number of epochs of the two stages is the same as a traditional quantization-aware training procedure. Thus, our training method is one-shot, without extra retraining steps.

3.4 Kernel-wise mixed precision quantization

As mentioned above, our algorithm is not restricted to layer-wise quantization, but also supports kernel-wise quantization. Here, one kernel means weight parameters associated with a convolution filter to produce a single-channel feature map. Weight kernels in a convolution layer are assigned with different bit-width parameters , where

is the index of the weight kernel. For each convolution operation of one weight kernel with the input tensor, the input tensor can also be assigned with different bit-widths. However, quantizing the input tensor with different bit-widths for different weight kernels requires large computation overhead. Here we the same bit-width

on the input tensor for computation with all the weight kernels. Note that [15] adopted the same strategy for kernel-wise quantization. For a 2D convolution layer, the number of BitOPs associated with the fractional bit-width is given by . And model size can be represented as .

3.5 Network pruning through quantization with 0-bit

The flexibility and differentiability of bit-width enables not only channel-wise quantization, but also channel pruning with quantization. To this end, in addition to generalize bit-width to fractional values, we add the definition of 0-bit for weight quantization, in which case we modify the definition in Eq. (3) to

(8)

In this case, weights with 0-bit will be quantized to 0, and the subsequent output channel can also be removed without affecting the network, which is essentially network pruning. Thus, by allowing 0-bit for weights together with channel-wise quantization, channel-pruning can be performed jointly with quantization. In practice, 0-bit is added as one candidate bit to the bit-width list of weight matrix. Compared with [15] which adopts a similar strategy with 0 bit for pruning, our method is differentiable on the bit-width including 0 bit to achieve one-shot mixed precision quantization and pruning. We conduct experiments for this joint optimization of pruning and quantization in Section 4.3.

4 Experiments

In this section, we conduct quantitative experiments using FracBits and compare it with previous quantization approaches including uniform quantization algorithms PACT [3], LQNet [30], SAT [12] and mixed precision quantization algorithms HAQ [23], AutoQ [15], DNAS [25], US [7], DQ [22]. We first compare our method with previous approaches on layer-wise mixed precision quantization in Section 4.2. Then we compare our method with a previous kernel-wise mixed precision method AutoQ on kernel-wise precision search. Finally, we conduct an ablation study on the hyper-parameters and configurations.

4.1 Implementation details

We build our algorithms based on recent quantization algorithms PACT [3] and SAT [12]. PACT jointly learns quantized weights and activations where weights are quantized using the DoReFa scheme [32]. SAT is an improved version of PACT algorithm with gradient calibration and scale adjusting. is a critical parameter for the proper convergence of the network towards required resource constraints. Models under mild or aggressive constraints may couple with different values of . Different types of resource constraints (computational cost and model size) have different scales and requires different scales of the regularization term. However, in our experiments, we find our algorithm is not very sensitive to values of . We set to for all computation cost constrained experiments, and for all model size constrained experiments. We also find it beneficial to initialize the model at some point close to the target resource constraint, facilitating more exploration close to the target model spaces. We control the initial state with the fractional bits in each layer, and set it to for all the experiments, where is the bit-width achieving similar resource constraints in the corresponding uniformly quantized model. For all experiments with weights and activations both quantized, we set the candidate bit-widths to be 2-8. For all experiments with only weights quantized, we set the candidate bit-widths to be 1-8. For kernel-wise quantization experiments, we also add 0 and 1 bits to the candidate bit-widths for weights to allow channel pruning as described in Section 3.4. Since the first and the last layers in a neural network have crucial impact on the performance of the model, we fix the bit-width of the first and last layer to 8 bit following [12].

For all experiments, we use cosine learing rate scheduler without restart. Learning rate is initially set to 0.05 and updated every iteration for totally 150 epochs. We use SGD optimizer with a momentum weight of 0.9 without damping, and weight decay of . The batch size is set to 2048 for all models. The warmup strategy suggested in [6] is also adopted by linearly increasing the learning rate every iteration to for the first five epochs before using the cosine annealing scheduler. Bit-width search is conducted in the first 120 epochs after the warmup stage. At the 121th epoch, all fractional bit-width will be rounded to integer bits, and the network will be further finetuned for the rest 30 epoches. This rounding process gives a sudden change to the network, but we do not observe any glitch in the training loss, potentially due to the insignificant difference in quantized values of two neighboring bit-widths. For kernel-wise precision quantization, we initialize the model from their layer-wise precision counterparts, which stabilizes the process of kernel-wise bit-width search. We adopt this strategy for all of our kernel-wise precision models.

bit-width 3 4 5
method top-1 top-5 bitops top-1 top-5 bitops top-1 top-5 bitops
MobileNet V1 PACT [3] 62.6 84.1 5.73 70.3 89.2 9.64 71.1 89.6 14.66
HAQ [23] - - - 67.4 87.9 - 70.6 89.8 -
FracBits-PACT 68.0 87.8 5.80 70.5 89.1 11.12 71.0 89.4 16.19
SAT [12] 67.1 87.1 5.73 71.3 89.9 9.64 71.9 90.3 14.66
FracBits-SAT 69.2 88.3 5.80 71.5 89.8 10.38 72.1 90.4 16.21
MobileNet V2 PACT [3] 67.0 87.0 3.32 70.6 89.2 5.35 71.2 89.8 7.96
HAQ [23] - - - 67 87.3 - 70.9 89.9 -
AutoQ [15] - - - 69 89.4 - - - -
DQ [22] - - - 69.7 - - - - -
FracBits-PACT 67.4 87.5 3.55 70.7 89.3 5.78 71.3 89.6 8.78
SAT [12] 67.2 87.3 3.32 70.8 89.7 5.35 72.0 90.1 7.96
FracBits-SAT 67.8 87.8 3.61 71.3 89.7 5.88 72.3 90.3 8.70
Table 2: Comparison of computation cost constrained layer-wise quantization of our method and previous approaches on ImageNet with MobileNet V1/V2. Note that accuracies are in % and bitops are in B (billion).
bit-width 3 4 FP
method top-1 bitops top-1 bitops top-1
PACT [3] 68.3 -1.9 22.83 69.2 -1.0 34.70 70.2
LQNet [30] 68.2 -2.1 22.83 69.3 -1.0 34.70 70.3
DNAS [25] 68.7 -2.3 24.34* 70.6 -0.4 35.17* 71.0
DQ [22] - - - 70.1 -0.2 - 70.3
AutoQ [15] - - - 68.2 -1.7 - 69.9
US [7] 69.4 -1.5 22.11* 70.5 -0.4 33.74* 70.9
FracBits-PACT 69.0 -1.2 23.15 69.9 -0.3 38.16 70.2
SAT [12] 69.3 -0.9 22.83 70.3 0.1 34.70 70.2
FracBits-SAT 69.7 -0.5 23.30 70.4 0.2 37.55 70.2
Table 3: Comparison of computation cost constrained layer-wise quantization of our method and previous approaches on ImageNet with ResNet18. Note bitops of US [7] and DNAS [25] does not include first and last layer in their papers, and US shows different bitops numbers from ours. We give an estimation of their bitops based on the difference with uniformly quantizated models. Note that accuracies are in % and bitops are in B (billion).
(a)
(b)
Figure 2: Layer-wise mixed precision quantization for 3 bit MobileNet V2 (a) and ResNet18 (b).

4.2 Quantization with layer-wise precision

We compare FracBits with previous quantization algorithms on layer-wise precision search. We conducted experiments on MobileNet V1/V2 and ResNet18. Since FracBits can be used for both computation cost constrained and model size constrained bit-width search, we conduct experiments on both settings to validate the effectiveness of our approach.

Table 2 shows experiment results of layer-wise computation cost constrained quantization on MobileNet V1/V2. We report result of our method with two qunatization schemes PACT and SAT, and denote the two variants as FracBits-PACT and FracBits-SAT. The previous methods HAQ [23] and AutoQ [15] use PACT as quantization scheme, while DQ uses a similar scheme to PACT with learnable clipping bounds. FracBits-PACT outperforms HAQ on both MobileNet V1 and V2, and outperforms AutoQ and DQ on MobileNet V2. SAT is a strong uniform quantization baseline which already outperforms all previous mixed precision methods. For example, it already achieves 71.9% on 5-bit MobileNet V1 and 72.1% on 5-bit MobileNet V2, almost closing the gap between full precision models and quantized ones. We believe that validating the effectiveness of our FracBits algorithm based on SAT is helpful towards seeking the limit of mixed precision quantization algorithms. FracBits-SAT achieves slightly better performance compared to SAT on 4- and 5-bit MobileNet V1/V2, and significantly better result on 3-bit models, which proves its effectiveness on strong uniform quantization baselines. It has a absolute gain on 3-bit MobileNet V1 and a gain on MobileNet V2. Note the BitOPs of models using FracBits is slightly higher than the resource target, mostly within upper range of the BitOPs constraint. This is due to the straight-forward rounding operation to discretize the bit-width which does not optimize bit-width allocation according to the resource constraint. More sophisticated method such as integer programming could be used in the bit-width discretization step to enforce tight resource constraint, which we leave as future work.

We show comparison with more algorithms on ResNet18, which is shown in Table 3. Here we compare with uniform precision approaches PACT, LQNet and mixed precision approaches DNAS, DQ, AutoQ, and US. Except DQ, all mixed precision approaches use PACT as quantization scheme. Since all methods report different accuracies for full precision (FP) models, we also add the top-1 accuracy of FP models reported in corresponding papers and report the relative accuracy drop for each method. Comparing absolute accuracy, FracBits-PACT achieves comparable performance as state-of-the-art mixed precision methods. Note DNAS uses several tricks in training to boost performance, thus its result is not directly comparable to others. Comparing relative accuracy drop, our method achieves least performance drop on 3-bit ResNet18, and is among the top 2 with least drop on 4-bit ResNet18. Enhanced by SAT quantization method, FracBits-SAT further improves over SAT baseline and achieves only accuracy drop on 3-bit ResNet18 and a performance gain on 4-bit ResNet18.

To have a more intuitive understanding of the learned bit-width structure from our algorithm, we plot the bit-widths from different layers for 3-bit MobileNet V2 and ResNet18, as shown in Fig. 2. We find that models for mixed quantization contrained on computational cost generally uses more bit-width on the late stage of the network. Also, in MobileNet V2, depth-wise convolution results in more bit-width than point-wise due to their low computation cost.

bit-width 2 3 4
method top-1 top-5 size top-1 top-5 size top-1 top-5 size
MobileNet V1 DeepComp [8] 37.6 64.3 - 65.9 86.9 - 71.1 89.8 -
HAQ [23] 57.1 81.9 - 67.7 88.2 - 71.7 90.4 -
SAT [12] 66.3 86.8 1.83 70.7 89.5 2.22 72.1 90.2 2.62
FracBits-SAT 69.3 88.8 1.87 71.1 89.8 2.26 72.3 90.3 2.66
MobileNet V2 DeepComp [8] 58.1 82.2 - 68.0 88.0 - 71.2 89.9 -
HAQ [23] 66.8 87.3 - 70.9 89.8 - 71.5 90.2 -
SAT [12] 66.8 87.2 1.83 71.1 89.9 2.11 72.1 90.6 2.38
FracBits-SAT 69.9 89.3 1.84 72.2 90.4 2.18 72.5 90.5 2.46
Table 4: Comparison of model size constrained layer-wise quantization of our method and previous approaches on ImageNet with MobileNet V1/V2. Note that accuracies are in % and sizes are in MB.

For model size constrained quantization, we show comparison with previous methods Deep Compression [8], HAQ and uniform quantization approach SAT. Our FracBits-SAT outperforms mixed precision methods HAQ and strong uniform quantization baseline SAT on all experimented bit-widths consistently. Notable, FracBits has a absolute gain on top-1 accuracy over SAT on 2-bit MobileNet V1/V2. On the challenging 4-bit setting where quantized models already achieve similar performance as full precision ones, FracBits also outperforms SAT with a margin on MobileNet V1 and a gain on MobileNet V2 in top-1 accuracy.

(a)
(b)
Figure 3: Kernel-wise mixed precision quantization for 3 bit MobileNet V2 (a) and ResNet18 (b).
bit-width 3 4
method top-1 top-5 bitops top-1 top-5 bitops
MobileNet V2 AutoQ [15] - - - 70.8 90.3 -
FracBits-PACT-K 68.3 87.8 3.64 70.9 89.5 5.70
SAT [12] 67.2 87.3 3.32 70.8 89.7 5.35
FracBits-SAT-K 68.4 88.2 3.64 71.4 89.9 5.62
ResNet18 AutoQ [15] - - - 69.8 88.4 -
FracBits-PACT-K 69.2 88.4 25.19 70.1 89.1 37.95
SAT [12] 69.3 88.9 22.83 70.3 89.5 34.70
FracBits-SAT-K 69.7 88.8 25.06 71.0 89.6 37.99
Table 5: Comparison of computation cost constrained kernel-wise quantization of our method and previous approaches on MobileNet V2 and ResNet18. Note that accuracies are in % and bitops are in B (billion).

4.3 Quantization with kernel-wise precision

In this section, we experiment with quantization on kernel-wise precision. Among previous approaches, only AutoQ [15] has experiments on kernel-wise precision which we will compare with. In Table 5, we denote kernel-wise FracBits based on PACT and SAT as FracBits-PACT-K and Fracbits-SAT-K, and compare them with AutoQ and uniform precision method SAT. FracBits-PACT-K achieve slightly better results than AutoQ on MobileNet V2 and ResNet18, validating the effectiveness of our one-shot differentiable approach compared to complex RL based method. FracBits-SAT-K outperforms SAT significantly with and increase on top-1 accuracy on 3 and 4-bit MobileNet V2 respectively, and with and increase on 3 and 4-bit ResNet18, respectively. Compared to layer-wise precision counterparts, FracBits-SAT-K outperforms FracBits-SAT by and on 3 and 4-bit MobileNet V2, respectively. It also outperforms layer-wise FracBits-SAT by on 4-bit ResNet18, proving kernel-wise quantization can further improve over strong layer-wise mixed-precision models. Fig. 3 illustrates the bit-width distribution against layer indices for 3-bit MobileNet V2 and ResNet18. We can see that 3-bit MobileNet V2 has a bunch of pruned weight kernels in the early layers and intermediate bottleneck layers, while 3-bit ResNet-18 almost does not have pruned kernels. We believe that the point-wise convolutions in MobileNet V2 have much larger computation cost compared to depth-wise convolutions thus they receive a larger resource penalty during optimization, which leads to more pruned kernels.

bit-width 3 4 5
top-1 top-5 bitops top-1 top-5 bitops top-1 top-5 bitops
FracBits-SAT 69.2 88.3 5.80 71.5 89.8 10.38 72.1 90.4 16.21
w/ gumbel 69.0 88.3 6.76 70.4 89.3 11.19 72.0 90.0 16.18
=0.2 68.9 88.1 5.78 71.6 90.0 10.94 72.0 90.3 16.39
=0.05 69.4 88.3 7.34 71.1 89.7 9.83 72.0 90.2 16.31
Table 6: A comparative study of our method with different configurations and hyper-parameters on MobileNet V1 for compution cost constrained quantization. w/ gumbel denotes the models using gumbel softmax to sample stochastic bit-width in searching. Note that accuracies are in % and bitops are in B (billion).

4.4 Ablation Study

We show some ablation study related to our method in this section. Since our framework is clean and only involves one hyper-parameter . We show a comparative study of using different values of . Another variant we can compare with is using stochastic bit-width following [25] instead of determined fractional bits in the searching stage. Towards this end, we utilize gumbel softmax [10] to generate stochastic bit-widths based on the original fractional bit-widths. is set to the same value as the deterministic approach and temperature for gumbel softmax is set to 1. The results are show in Table 6. With smaller value of at 0.05, FracBits-SAT yields a large discrepancy from the desired BitOps on 3-bit MobileNet V1, meaning small values of may fail to reach the desired resource constraint due to weak penalty. With larger value of 0.2, the models still perform similarly as with , proving the robustness of our method within a proper range of . We have also experimented with as large as which results in a rapid descend of bit-widths values in the beginning of training and generates poor result. With gumbel softmax, the result is slightly worse than the original FracBits-SAT, proving the advantage of our deterministic approach. Also we notice that model with gumbel softmax does not meet the desired computation budget in 3 and 4-bit models.

5 Conclusion

We propose a new formulation named FracBits for mixed precision quantization. We formulate the bit-width of each layer or kernel with a continuous learnable parameter that can be instantiated by interpolating quantized parameters of two neighboring bit-widths. Our method facilitates differentiable optimization of layer-wise or kernel-wise bit-width in a single shot of training, which can further be combined with channel pruning by formulating a pruned channel with 0 bit quantization. With only a regularized term to penalize extra computational resource in the training process, our method is able to discover proper bit-width configurations for different models, outperforming previous mixed precision and uniform precision approaches. We believe our method will motivate research along low-precision neural networks, and low-cost computational models.

References

  • [1] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    arXiv preprint arXiv:1308.3432. Cited by: §1, §3.1.
  • [2] H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §1.
  • [3] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §1, §2, §4.1, Table 2, Table 3, §4.
  • [4] A. T. Elthakeb, P. Pilligundla, A. Yazdanbakhsh, S. Kinzer, and H. Esmaeilzadeh (2018) ReLeQ: A reinforcement learning approach for deep quantization of neural networks. In NuerIPS, Cited by: Table 1, §1, §1, §2.
  • [5] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi (2018) Morphnet: fast & simple resource-constrained structure learning of deep networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1586–1595. Cited by: §2.
  • [6] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR abs/1706.02677. External Links: Link, 1706.02677 Cited by: §4.1.
  • [7] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. CoRR abs/1904.00420. External Links: Link, 1904.00420 Cited by: Table 1, §1, §2, §3.2, Table 3, §4.
  • [8] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2, §4.2, Table 4.
  • [9] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.
  • [10] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §4.4.
  • [11] Q. Jin, L. Yang, and Z. Liao (2019) AdaBits: neural network quantization with adaptive bit-widths. arXiv preprint arXiv:1912.09666. Cited by: §1.
  • [12] Q. Jin, L. Yang, and Z. Liao (2019) Towards efficient training for neural network quantization. arXiv preprint arXiv:1912.10207. Cited by: §1, §2, §4.1, Table 2, Table 3, Table 4, Table 5, §4.
  • [13] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §2.
  • [14] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §2.
  • [15] Q. Lou, L. Liu, M. Kim, and L. Jiang (2019)

    AutoQB: automl for network quantization and binarization on mobile devices

    .
    CoRR abs/1902.05690. External Links: Link, 1902.05690 Cited by: Table 1, §1, §1, §2, §3.2, §3.4, §3.5, §4.2, §4.3, Table 2, Table 3, Table 5, §4.
  • [16] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §2.
  • [17] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. Yuille, and J. Yang (2019) AtomNAS: fine-grained end-to-end neural architecture search. arXiv preprint arXiv:1912.09640. Cited by: §2.
  • [18] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling (2019) Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: §2.
  • [19] M. Nikolić, G. B. Hacene, C. Bannon, A. D. Lascorz, M. Courbariaux, Y. Bengio, V. Gripon, and A. Moshovos (2020) BitPruning: learning bitlengths for aggressive and accurate quantization. arXiv preprint arXiv:2002.03090. Cited by: §3.2.
  • [20] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
  • [21] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §1.
  • [22] S. Uhlich, L. Mauch, K. Yoshiyama, F. Cardinaux, J. A. Garcia, S. Tiedemann, T. Kemp, and A. Nakamura (2020) MIXED precision dnns: all you need is a good parametrization. ICLR. Cited by: Table 1, §1, §1, §3.1, §3.2, Table 2, Table 3, §4.
  • [23] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: Table 1, §1, §1, §2, §3.2, §4.2, Table 2, Table 4, §4.
  • [24] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1.
  • [25] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090. Cited by: Table 1, §1, §1, §2, §3.2, §4.4, Table 3, §4.
  • [26] S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §1.
  • [27] J. Ye, X. Lu, Z. Lin, and J. Z. Wang (2018) Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124. Cited by: §2.
  • [28] J. Yu and T. S. Huang (2019) Network slimming by slimmable networks: towards one-shot architecture search for channel numbers. CoRR abs/1903.11728. External Links: Link, 1903.11728 Cited by: §2.
  • [29] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §2.
  • [30] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365–382. Cited by: Table 3, §4.
  • [31] S. Zhou, Y. Wang, H. Wen, Q. He, and Y. Zou (2017) Balanced quantization: an effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology 32 (4), pp. 667–682. Cited by: §1.
  • [32] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §4.1.