Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

08/05/2021 ∙ by Ziwei Wang, et al. ∙ Tsinghua University 0

In this paper, we propose a generalizable mixed-precision quantization (GMPQ) method for efficient inference. Conventional methods require the consistency of datasets for bitwidth search and model deployment to guarantee the policy optimality, leading to heavy search cost on challenging largescale datasets in realistic applications. On the contrary, our GMPQ searches the mixed-quantization policy that can be generalized to largescale datasets with only a small amount of data, so that the search cost is significantly reduced without performance degradation. Specifically, we observe that locating network attribution correctly is general ability for accurate visual analysis across different data distribution. Therefore, despite of pursuing higher model accuracy and complexity, we preserve attribution rank consistency between the quantized models and their full-precision counterparts via efficient capacity-aware attribution imitation for generalizable mixed-precision quantization strategy search. Extensive experiments show that our method obtains competitive accuracy-complexity trade-off compared with the state-of-the-art mixed-precision networks in significantly reduced search cost. The code is available at https://github.com/ZiweiWangTHU/GMPQ.git.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have achieved the state-of-the art performance across a large number of vision tasks such as image classification

[15, 44, 18], object detection [40, 28, 14]

, face recognition

[6, 49, 29] and many others. However, the mobile devices with limited storage and computational resources are not capable of processing deep models due to the extremely high complexity. Therefore, it is desirable to design network compression strategy according to the hardware configurations.

Recently, several network compression techniques have been proposed including pruning [27, 16, 33], quantization [59, 30, 51], efficient architecture design [20, 17, 37] and low-rank decomposition [7, 57, 25]. Among these approaches, quantization constrains the network weights and activations in limited bitwidth for memory saving and fast processing. In order to fully utilize the hardware resources, mixed-precision quantization [50, 9, 3]

is presented to search the bitwidth in each layer so that the optimal accuracy-complexity trade-off is obtained. However, conventional mixed-precision quantization requires the consistency of datasets for bitwidth search and network deployment to guarantee policy optimality, which causes significant search burden for automated model compression on largescale datasets such as ImageNet

[5]. For example, it usually takes several GPU days to acquire the expected quantization strategy for ResNet18 on ImageNet [50, 3].

In this paper, we present a GMPQ method to learn generalizable mixed-precision quantization strategy via attribution rank preservation for efficient inference. Unlike existing methods which requires the dataset consistency between quantization policy search and model deployment, our method enables the acquired quantization strategy to be generalizable across various datasets. The quantization policy searched on small datasets achieves promising performance on challenging largescale datasets, so that policy search cost is significantly reduced. Figure 1(a) shows the difference between our GMPQ and conventional mixed-precision networks. More specifically, we observe that correctly locating the network attribution benefits visual analysis for various input data distribution. Therefore, despite of considering model accuracy and complexity, we enforce the quantized networks to imitate the attribution of the full-precision counterparts. Instead of directly minimizing the Euclidean distance between attribution of quantized and full-precision models, we preserve their attribution rank consistency so that the attribution of quantized networks can adaptively adjust the distribution without capacity insufficiency. Figure 1(b) demonstrates the attribution computed by Grad-cam [42] for mixed-precision networks with optimal and random quantization policy and their full-precision counterparts, where the mixed-precision networks with the optimal bitwidth assignment acquire more consistent attribution rank with the full-precision model. Experimental results show that our GMPQ obtains competitive accuracy-complexity trade-off on ImageNet and PASCAL VOC compared with the state-of-the-art mixed-precision quantization methods in only several GPU hours.

2 Related Work

Fixed-precision quantization:

Network quantization has aroused extensive interests in computer vision and machine learning due to the significant reduction in computation and storage complexity, and existing methods are divided into one-bit and multi-bit quantization. Binary networks constrain the network weights and activations in one bit at extremely high compression ratio. For the former, Hubara

et al. [19] and Courbariaux et al. [4]

replaced the multiply-add operations with xnor-bitcount via weight and activation binarization, and applied the straight-through estimators (STE) to optimize network parameters. Rastegari

et al. [39] leveraged the scaling factor for weight and activation hashing to minimize the quantization errors. Liu et al. [30] added extra shortcut between consecutive convolutional layers to enhance the network capacity. Wang et al. [54] mined the channel-wise interactions to eliminate inconsistent signs in feature maps. Qin et al. [36] minimized the parameter entropy in inference and utilized the soft quantization in backward propagation to enhance the information retention. Since the performance gap between full-precision and binary networks is huge, multi-bit networks are presented for better accuracy-efficiency trade-off. Zhu [61] trained an adaptive quantizer for network ternarization according to weight distribution. Gong et al. [12] applied the differentiable approximations for quantized networks to ensure the consistency between the optimization and the objective. Li et al. [24]

proposed the four-bit networks for object detection with hardware-friendly implementations, and overcome the training instabilities by custom batch normalization and outlier removal. However, the fixed-precision quantization ignores the redundancy variance across different layers and leads to suboptimal accuracy-complexity trade-off in quantized networks.

Mixed-precision quantization: The mixed-precision networks assign different bitwidths to weights and activations in various layers, which considers the redundancy variance in different components to obatin the optimal accuracy-efficiency trade-off given hardware configurations. Existing mixed-precision quantization methods are mainly based on either non-differentiable or differentiable search. For the former, Wang et al. [50]

presented a reinforcement learning model to learn the optimal bitwidth for weights and activations of each layer, where the model accuracy and complexity were considered in reward function. Wang

et al. [52]

jointly searched the pruning ratio, the bitwidth and the architecture of the lightweight model from a hypernet via the evolutionary algorithms. Since the non-differentiable methods require huge search cost to obtain the optimal bitwidth, the differentiable search approaches are also introduced in mixed-precision quantization. Cai

et al. [3] designed a hypernet where each convolutional layer consisted of parallel blocks in different bitwidths, which yielded the output by summing all blocks in various weights. Optimizing the block weight by back propagation and selecting the bitwidth with the largest value during inference achieved the optimal accuracy-complexity trade-off. Moreover, Yu et al. [56] further presented a barrier penalty to ensure that the searched models were within the complexity constraint. Yang et al. [55] decoupled the constrained optimization via Alternating Direction Method of Multipliers (ADMM), and Wang et al. [53] utilized the variational information bottleneck to search for the proper bitwidth and pruning ratio. Habi et al. [13] and Van et al. [48] directly optimized the quantization intervals for bitwidth selection of mixed-precision networks. However, differentiable search for mixed-precision quantization still needs a large amount of time due to the optimization of the large hypernet. In order to solve this, Dong et al. [9, 8] designed bitwidth assignment rules according to Hessian information. Nevertheless, the hand-crafted rules require expert knowledge and cannot adapt to the input data.

Attribution methods: Attribution aims to produce human-understandable explanations for the predictions of neural networks. The contribution of each input component is calculated by examining the its influence on the network output, which is displayed as the attribution in 2D feature maps. Early works [10, 43, 60] analyzed the sensitivity and the significance of each pixel by leveraging its gradients with respect to the optimization objective. The recent studies on attribution extraction can be categorized into two types: gradient-based and relevance-based methods. For the first regard, Guided Backprop [45], Grad-Cam [42] and integrated gradient [46] combined the pixel gradients across different locations and channels for information fusion, so that more accurate attribution was obtained. For the latter regard, Zhang et al. [58] constructed a hierarchical probabilistic model to mine the correlation between the input components and the prediction. In this paper, we observe that the attribution rank consistency of feature maps between vanilla and compressed networks benefits visual analysis for various data distribution, which is extended to generalizable mixed-precision quantization for significant search cost reduction.

3 Approach

In this section, we first introduce the mixed-precision quantization framework which suffers from significant search burden. Then we demonstrate the observation that the attribution rank consistency between full-precision and quantized models benefits visual analysis for various data distribution. Finally, we present the generalizable mixed-precision quantization via attribution rank preservation.

Figure 2: The attribution of the mixed-precision networks in different capacities with the optimal quantization policy. For the networks in low bitwidth, the attribution is more concentrated although the rank remains similar. The concentrated attribution enables the model capacity to be sufficient by redundant attention removal, so that the promising performance is achieved.
Figure 3: The pipeline of our GMPQ. The hypernet consists of multiple parallel branches including convolutional filters and activations in different bitwidths. The output from various branches is added with learnable importance weights to construct the output feature maps. Despite of the cross-entropy and complexity loss, we present additional generalization loss to optimize the network weights and branch importance weights, which enables the quantization policy searched on small datasets to be generalized on largescale datasets.

3.1 Mixed-Precision Quantization

The goal of mixed-precision quantization is to search the proper bitwidth of each layer in order to achieve the optimal accuracy-complexity trade-off given hardware configurations. Let be the quantized network weight and be the quantization policy that assigns different bitwidths to weights and activations in various layers. means the computational complexity of the compressed networks with the quantization policy . The search objective function is written as the following bi-level optimization form:

(1)

where and depict the task loss on the validation data and the training data. stands for the resource constraint of the deployment platform. In order to obtain the optimal mixed-precision networks, the quantization policy and the network weights are alternatively optimized until convergence or the maximal iteration number. Since the distribution of the training and validation data for policy search significantly affects the acquired quantization strategy, existing methods require the training and validation data for quantization policy search and those for model deployment to come from the same dataset. However, the compressed models are usually utilized on largescale datasets such as ImageNet, which causes heavy computational burden during quantization policy search. To address this, an ideal solution is to search for the quantization policy whose optimality is independent of the data distribution. The search objective should be modified in the following:

(2)

where represents the task loss for network weight , quantization policy and input . depicts the dataset containing all validation images in deployment and illustrates the dataset including given training images in bitwidth search, where the distribution gap between and may be sizable. Because is intractable in realistic applications, it is desirable to find an alternative way to solve for the generalizable mixed-precision quantization policy.

3.2 Attribution Rank Consistency

Since acquiring all validation images in deployment is impossible, we solve for the generalizable mixed-precision quantization policy via an alternative way. We observe that correctly locating the network attribution benefits visual analysis for various input data distribution. The feature attribution is formulated according to the loss gradient with respect to each feature map, where the importance of the feature map in the last convolutional layer for recognizing the objects from the class is written as follows:

(3)

where means the output score for input of the class, and represents the activation element in the row and column of the feature map in the last convolutional layer. is a scaling factor that normalizes the importance into the range . With the feature map visualization techniques presented in Grad-cam [42], we obtain the feature attribution in the networks. We sum the feature maps from different channels with the attention weight calculated in (3

), and remove the influence from opposite pixels via the ReLU operation. The feature attribution in the last convolutional layer with respect to the

class is formulated in the following:

(4)

The feature attribution only preserves the supportive features for the given class, and the negative features related to other classes are removed.

The full-precision networks achieve high performance due to paying more attention to important parts in the image, while the quantized models deviate the attribution from that of the full-precision networks due to the limited capacity. Figure 2 demonstrates the attribution of networks with the optimal quantization policy in different complexity, where attribution of networks in lower capacity is more concentrated due to the limited carried information. As the network capacity gap between the quantized networks and their full-precision counterparts is huge, directly enforcing the attribution consistency fails to remove the redundant attention in the compressed model, which causes capacity insufficiency with performance degradation. Therefore, we preserve the attribution rank consistency between the quantized networks and their full-precision counterparts for generalizable mixed-precision quantization policy search. The attribution rank illustrates the importance order of different pixels for model predictions. Constraining attribution rank consistency enables the quantized networks to focus on important regions, which adaptively adjusts the attribution distribution without capacity insufficiency.

3.3 Generalizable Mixed-Precision Quantization via Attribution Rank Preservation

Our GMPQ can be leveraged as a plug-and-play module for both non-differentiable and differentiable search methods. Since differentiable methods achieve the competitive accuracy-complexity trade-off compared with non-differentiable approaches, we employ the differentiable search framework [3, 56, 55] to select the optimal mixed-precision quantization policy. We design a hypernet with and parallel branches for convolution filters and feature maps in the layer. and represent the size of the search space for weight and activation bitwidths. The parallel branches are assigned with various bitwidths whose output is summed with the importance and for weight and activation respectively to form the intermediate feature maps. Figure 3 depicts the pipeline of our GMPQ. The feed-forward propagation for each layer in the -layer hypernet is written as follows:

(5)

where means the output intermediate feature maps of the layer. represents the output of the activation quantization branch in the layer, and is the convolution operation in the filter branch of the layer. and stand for the importance weight for the quantized activation and filter branch in the layer.

Figure 4: The norm of the attribution for the full-precision networks with different . The attribution is more concentrated for larger while the rank keeps same.

As we observe that the attribution rank consistency between quantized networks and their full-precision counterparts enables the compressed models to possess the discriminative power of the vanilla model regardless of the data distribution, we impose the attribution rank consistency constraint in optimal quantization policy search despite of the accuracy and efficiency objective. In order to obtain the optimal accuracy-complexity trade-off for generalizable mixed-precision quantization, the learning objective is formulated in the Lagragian form:

(6)

where , and respectively mean the classification, complexity and the generalization risk for the networks with weight and quantization policy for the input . and

are the hyperparameters to balance the importance of the complexity risk and generalization risk in the overall learning objective. In differentiable policy search,

is represented by the objective of vision tasks, and is defined as the expected Bit-operations (BOPs) [53, 1, 3]:

(7)

where and stand for the bitwidth of the branch of weights and activations in the layer, and means the BOPs of the layer in the full-precision network. represents the number of layers of the quantized model. As the attribution rank consistency between the full-precision networks and their quantized counterparts enhance the generalizability of the mixed-precision quantization policy, we define the generalization risk in the following form:

where represents the pixel attribution in the row and column of the feature maps with respect to the class in the quantized networks, and demonstrates the corresponding variable in full-precision models. means the label of the input , and is the element-wise norm. stands for the attribution rank, which equals to if the element is the largest in the attribution map. We only preserve the attribution rank consistency for top-k pixels with the highest attribution in the full-precision networks, as low attribution is usually caused by noise without clear information. Since minimizing the generalization risk is NP-hard, we present the capacity-aware attribution imitation to differentially optimize the objective.

We enforce attribution of the mixed-precision networks to approach the norm of that in full-precision models, because the norm preserves the rank consistency while adaptively selects the attribution distribution according to the network capacity. The generalization risk is rewritten as follows for efficient optimization:

Large leads to concentrated attribution and vice versa, and we assign with larger value for hypernets in lower capacity with hyperparamters and for L-layer networks:

(8)

Since the classification, complexity and generalization risks are all differentiable, we optimize the hypernet weight and the branch importance weight iteratively in an end-to-end manner. When the hypernet converges or achieves the maximum training epoch, the bitwidth represented by the branch with the largest important weight is selected to form the final quantization policy. We finetune the quantized networks with the data in deployment to acquire the final model applied in realistic applications. GMPQ searches quantization policies on small datasets with generalization constraint, which leads to high performance on largescale datasets in deployment with significantly reduced search cost.

4 Experiments

In this section, we conducted extensive experiments on image classification and object detection. We first introduce the implementation details of our GMPQ. In the following ablation study, we then evaluated the influence of value assignment strategy for in the capacity-aware attribution imitation, investigated the effects of different terms in the risk function and discovered the impact of the dataset for quantization policy search. Finally, we compare our method with the state-of-the-art mixed-precision networks on image classification and object detection with respect to accuracy, model complexity and search cost.

4.1 Datasets and Implementation Details

We first introduce the datasets that we carried experiments on. For quantization policy search, we employed the small datasets including CIFAR-10

[23], Cars [22], Flowers [34], Aircraft [32], Pets [35] and Food [2]. CIFAR-10 contains images divided into categories with equal number of samples, and Flowers have 8,189 images spread over 102 flower categories. Cars includes images with types at the level of maker, model and year, and Aircraft contains images with samples for each of the aircraft model variants. Pet was created with dog and cat categories with images for each class, and Food contains high-resolution food photos of menu items from the restaurants.

For mixed-precision network deployment, we evaluated the quantized networks on ImageNet for image classification and on PASCAL VOC for object detection. ImageNet [5] approximately contains billion and k images for training and validation from categories. For training, random region crops were applied from the resized image whose shorter side was . During the inference stage, we utilized the center crop. The PASCAL VOC dataset [11] collects images from categories, where we fintuned our mixed-precision networks on VOC 2007 and VOC 2012 trainval sets containing about k images and tested our GMPQ on VOC 2007 test set consisting of k samples. Following [11]

, we used the mean average precision (mAP) as the evaluation metric.

We trained our GMPQ with MobileNet-V2 [41], ResNet18 and ResNet50 [15] architectures for image classification, and applied VGG16 [44] with SSD framework [28] and ResNet18 with Faster R-CNN [40] for object detection. The bitwidth in the search space for network weights and activations is - bit for MobileNet-V2 and - bit for other architectures. Inspired by [3], we utilized compositional convolution whose filters were weighted sum of each quantized filters in different bitwidths, so that complex parallel convolution was avoided. We updated the importance weight of different branches and the network parameters simultaneously. The hyperparameters and in capacity-aware attribution imitation were set to and respectively. Meanwhile, we only minimize the distance between attribution in quantized networks and norm of that in full-precision model for top- pixels with the highest attribution in the real-valued model. For evaluation on ImageNet, we finetuned the mixed-precision networks with the Adam [21] optimizer. The learning rate started from and decayed twice by multiplying at the and epoch out of the total epochs. For object detection, the backbone was pretrained on ImageNet and then finetuned on PASCAL VOC with the same hyperparameter settings on image classification. The batchsize was set to be in all experiments. By adjusting the hyperparameters and in (6), we obtained the mixed-precision networks at different accuracy-complexity trade-offs.

(a) Fixed strategy
(b) Capacity-aware strategy
Figure 5: The accuracy-complexity trade-off of (a) fixed and (b) capacity-aware value assignment strategies for in (8), where hyperparameters are also varied.

4.2 Ablation Study

In order to investigate the effectiveness of attribution rank preservation, we assign the value of in the capacity-aware attribution imitation with different strategies. By varying the hyperparameters and in the overall risk (6), we evaluated the influence of classification, complexity and generalization risks with respect to the model accuracy and efficiency. We conducted the ablation study on ImageNet with the ResNet18 architecture, and searched the mixed-precision quantization policy on CIFAR-10 for the above investigation. Moreover, we searched the generalizable mixed-precision quantization policy on different small datasets to discover the effects on the accuracy-complexity trade-off and search cost.

(a) Varying and
(b) Varying datasets
Figure 6: (a) The accuracy-complexity trade-off for different , where is varied to select various network capacity. (b) The top-1 accuracy on ImageNet, the BOPs and the average search cost of the mixed-precision quantization policy searched on different small datasets, where GH means GPU hours for the search cost.

Effectiveness of different value assignment strategies for : To investigate the influence of value assignment strategies to on the accuracy-complexity trade-off, we searched the mixed-precision quantization policy with fixed and capacity-aware value. For fixed , we set the value as , , and that constrains the attribution of quantized networks with various concentration. The capacity-aware strategy assigns with the strategy shown in (8), where the product of and was varied in the ablation study. Figure 5(a) and 5(b) demonstrate the accuracy-complexity trade-off for fixed and capacity-aware value assignment strategies for respectively with different hyperparameters. The optimal accuracy-complexity curve in capacity-aware strategy outperforms that in fixed strategy, which indicates the importance of attribution variation with respect to network capacity. For fixed strategy, medium outperforms other values. Small causes attention redundancy for quantized networks with limited capacity and large leads to information loss that fails to utilize the network capacity. For capacity-aware strategy, setting the product of and to results in the optimal accuracy-complexity trade-off. For hypernetworks whose product of weight and activation bitwidths is , the network capacity is comparable with their full-precision counterparts since they mimic the attribution of real-valued models without extra concentration.

Methods Param. BOPs Comp. Top-1 Top-5 Cost.
ResNet18
Full-precision
ALQ
HAWQ
GMPQ
APoT
GMPQ
ALQ
EdMIPS
EdMIPS-C
GMPQ
ResNet50
Full-precision
HAWQ
HAQ
BP-NAS
GMPQ
HMQ
BP-NAS
GMPQ
EdMIPS
EdMIPS-C
GMPQ
MobileNet-V2
Full-precision
RQ
GMPQ
HAQ
HAQ-C
DJPQ
GMPQ
HMQ
DQ
GMPQ
Table 1: The top-1/top-5 accuracy (%) on ImageNet, model storage cost (M), model computational cost (G) and the search cost (GPU hours) for networks in different capacity and mixed-precision quantization policy. Param. means the model storage cost, and Comp. means the compression ratio of BOPs.

Influence of hyperparameters in overall risk (6): In order to verify the effectiveness of the generalization risk, we report the performance with different . Meanwhile, we also varied the hyperparameter to obtain different accuracy-complexity trade-offs. Figure 6(a) illustrates the results, where medium achieves the best trade-off curve. Large fails to leverage the supervision from annotated labels, and small ignores the attribution rank consistency which enhances the generalization ability of the mixed-precision quantization policy. With the increase of , the resulted policy prefers lightweight architectures and vice versa. For different , the same assignment of selects similar BOPs in the accuracy-complexity trade-off.

Effects of datasets for quantization policy search: We searched the mixed-precision quantization policy on different small datasets including CIFAR-10, Cars, Flowers, Aircraft, Pets and Food to discover the effects on model accuracy and efficiency. Figure 6(b) demonstrates the top-1 accuracy and the BOPs for the optimal mixed-precision networks obtained on different small datasets. We also show the average search cost across all computation cost constraint in the legend, where GH means GPU hours that measures the search cost. The mixed-precision networks searched on CIFAR-10 achieves the best accuracy-efficiency trade-off, because the size of CIFAR-10 is the largest with the most sufficient visual information. Moreover, the gap of object category between CIFAR-10 and ImageNet is the smallest compared with other datasets. Searching quantization policy on Aircraft requires the highest search cost due to the large image size .

Methods Param. BOPs Comp. mAP Cost
SSD & VGG16
Full-precision
HAQ
HAQ-C
EdMIPS
EdMIPS-C
GMPQ
HAQ
HAQ-C
EdMIPS
EdMIPS-C
GMPQ
Faster R-CNN & ResNet18
Full-precision
HAQ
HAQ-C
EdMIPS
EdMIPS-C
GMPQ
HAQ
HAQ-C
EdMIPS
EdMIPS-C
GMPQ
Table 2: The mAP (%) on PASCAL VOC, model storage cost (M), model computational cost (G) and the search cost (GPU hours) for backbone networks in different capacity and mixed-precision quantization policy. Param. means the model storage cost, and Comp. means the compression ratio of BOPs.

4.3 Comparison with State-of-the-art Methods

In this section, we compare our GMPQ with the state-of-the-art fixed-precision models containing APoT [26] and RQ [31] and mixed-precision networks including ALQ [38], HAWQ [9], EdMIPS [3], HAQ [50], BP-NAS [56], HMQ [13] and DQ [47] on ImageNet for image classification and on PASCAL VOC for object detection. We also provide the performance of full-precision models for reference. The accuracy-complexity trade-offs of baselines are copied from their original papers or obtained by our implementation with the officially released code, and the search cost was evaluated by re-running the compared methods. We searched the optimal quantization policy on CIFAR-10 for the deployment on ImageNet and PASCAL VOC.

Results on ImageNet: Table 1 illustrates the comparison of storage and computational cost, the compression ratio of BOPs, the top-1 and top-5 accuracy and the search cost across different architectures and mixed-precision quantization methods. HAQ-C and EdMIPS-C demonstrate that we leveraged HAQ and EdMIPS that searched the quantization policy on CIFAR-10 and evaluated the obtained quantization policy on ImageNet. By comparing the accuracy-complexity trade-off with the baseline methods for different architectures, we conclude that our GMPQ achieves the competitive accuracy-complexity trade-off under various resource constraint with significantly reduced search cost. Meanwhile, we also searched the quantization policy on CIFAR-10 directly using HAQ and EdMIPS. Although the search cost is reduced sizably, the accuracy-complexity trade-off is far from the optimal across various resource constraint, which indicates the lack of generalization ability for the quantization policy obtained by the conventional methods. Our GMPQ preserves the attribution rank consistency during the quantization policy search with acceptable computational overhead, and enables the mixed-precision quantization searched on small datasets to generalize to largescale datasets. For the mixed-precision quantization method EdMIPS, the search cost reduction is more obvious for ResNet50 since the heavy architecture requires more training epochs to converge when trained on largescale datasets.

Results on PASCAL VOC: We employed the SSD detection framework with VGG16 architecture and Faster R-CNN detector with ResNet18 backbone to evaluate our GMPQ on object detection. Table 2 shows the results of various mixed-precision networks. Compared with the accuracy-complexity trade-off searched on PASCAL VOC by the state-of-the-art methods, our GMPQ acquired the competitive results with significantly reduced search cost on both detection frameworks and backbones. Moreover, directly compressing the networks with the quantization policy searched by HAQ and EdMIPS on CIFAR-10 degrades the performance significantly. Since the mixed-precision networks are required to be pretrained on ImageNet, the search cost decrease on PASCAL VOC is more sizable than that on ImageNet. Because the two-stage detector Faster R-CNN has stronger discriminative power for accurate attribution generation, the accuracy-complexity trade-off is more optimal compared with the one-stage detector.

5 Conclusion

In this paper, we have proposed a generalizable mixed-quantization method called GMPQ for efficient inference. The presented GMPQ searches the quantization policy on small datasets with attribution rank preservation, so that the acquired quantization strategy can be generalized to achieve the optimal accuracy-complexity trade-off on largescale datasets with significant search cost reduction. Extensive experiments depict the superiority of GMPQ compared with the state-of-the-art methods.

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, and Grant U1713214, in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI), and in part by a grant from the Institute for Guo Qiang, Tsinghua University.

References

  • [1] J. Bethge, C. Bartz, H. Yang, Y. Chen, and C. Meinel (2020) MeliusNet: can binary neural networks achieve mobilenet-level accuracy?. arXiv preprint arXiv:2001.05936. Cited by: §3.3.
  • [2] L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101–mining discriminative components with random forests

    .
    In ECCV, pp. 446–461. Cited by: §4.1.
  • [3] Z. Cai and N. Vasconcelos (2020) Rethinking differentiable search for mixed-precision neural networks. In CVPR, pp. 2349–2358. Cited by: §1, §2, §3.3, §3.3, §4.1, §4.3.
  • [4] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §2.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §4.1.
  • [6] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, pp. 4690–4699. Cited by: §1.
  • [7] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736. Cited by: §1.
  • [8] Z. Dong, Z. Yao, Y. Cai, D. Arfeen, A. Gholami, M. W. Mahoney, and K. Keutzer (2019) Hawq-v2: hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852. Cited by: §2.
  • [9] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2019) Hawq: hessian aware quantization of neural networks with mixed-precision. In ICCV, pp. 293–302. Cited by: §1, §2, §4.3.
  • [10] D. Erhan, Y. Bengio, A. Courville, and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §2.
  • [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §4.1.
  • [12] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In ICCV, pp. 4852–4861. Cited by: §2.
  • [13] H. V. Habi, R. H. Jennings, and A. Netzer (2020) HMQ: hardware friendly mixed precision quantization block for cnns. arXiv preprint arXiv:2007.09952. Cited by: §2, §4.3.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §1.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §4.1.
  • [16] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1389–1397. Cited by: §1.
  • [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §1.
  • [19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In NIPS, pp. 4114–4122. Cited by: §2.
  • [20] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [22] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In ICCVW, pp. 554–561. Cited by: §4.1.
  • [23] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
  • [24] R. Li, Y. Wang, F. Liang, H. Qin, J. Yan, and R. Fan (2019) Fully quantized network for object detection. In CVPR, pp. 2810–2819. Cited by: §2.
  • [25] Y. Li, S. Gu, C. Mayer, L. V. Gool, and R. Timofte (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. In CVPR, pp. 8018–8027. Cited by: §1.
  • [26] Y. Li, X. Dong, and W. Wang (2020) Additive powers-of-two quantization: a non-uniform discretization for neural networks. ICLR. Cited by: §4.3.
  • [27] J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In NIPS, pp. 2178–2188. Cited by: §1.
  • [28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1, §4.1.
  • [29] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §1.
  • [30] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K. Cheng (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, pp. 722–737. Cited by: §1, §2.
  • [31] C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling (2018) Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875. Cited by: §4.3.
  • [32] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §4.1.
  • [33] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz (2019) Importance estimation for neural network pruning. In CVPR, pp. 11264–11272. Cited by: §1.
  • [34] M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §4.1.
  • [35] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar (2012) Cats and dogs. In CVPR, pp. 3498–3505. Cited by: §4.1.
  • [36] H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, and J. Song (2020) Forward and backward information retention for accurate binary neural networks. In CVPR, pp. 2250–2259. Cited by: §2.
  • [37] Z. Qin, Z. Li, Z. Zhang, Y. Bao, G. Yu, Y. Peng, and J. Sun (2019) Thundernet: towards real-time generic object detection on mobile devices. In ICCV, pp. 6718–6727. Cited by: §1.
  • [38] Z. Qu, Z. Zhou, Y. Cheng, and L. Thiele (2020) Adaptive loss-aware quantization for multi-bit networks. In CVPR, pp. 7988–7997. Cited by: §4.3.
  • [39] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §2.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §1, §4.1.
  • [41] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §4.1.
  • [42] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, pp. 618–626. Cited by: §1, §2, §3.2.
  • [43] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.
  • [44] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.1.
  • [45] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.
  • [46] M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In ICML, pp. 3319–3328. Cited by: §2.
  • [47] S. Uhlich, L. Mauch, K. Yoshiyama, F. Cardinaux, J. A. Garcia, S. Tiedemann, T. Kemp, and A. Nakamura (2019) Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452. Cited by: §4.3.
  • [48] M. van Baalen, C. Louizos, M. Nagel, R. A. Amjad, Y. Wang, T. Blankevoort, and M. Welling (2020) Bayesian bits: unifying quantization and pruning. arXiv preprint arXiv:2005.07093. Cited by: §2.
  • [49] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §1.
  • [50] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) Haq: hardware-aware automated quantization with mixed precision. In CVPR, pp. 8612–8620. Cited by: §1, §2, §4.3.
  • [51] P. Wang, Q. Chen, X. He, and J. Cheng (2020) Towards accurate post-training network quantization via bit-split and stitching. In ICML, pp. 9847–9856. Cited by: §1.
  • [52] T. Wang, K. Wang, H. Cai, J. Lin, Z. Liu, H. Wang, Y. Lin, and S. Han (2020) Apq: joint search for network architecture, pruning and quantization policy. In CVPR, pp. 2078–2087. Cited by: §2.
  • [53] Y. Wang, Y. Lu, and T. Blankevoort (2020) Differentiable joint pruning and quantization for hardware efficiency. In ECCV, pp. 259–277. Cited by: §2, §3.3.
  • [54] Z. Wang, J. Lu, C. Tao, J. Zhou, and Q. Tian (2019) Learning channel-wise interactions for binary convolutional neural networks. In CVPR, pp. 568–577. Cited by: §2.
  • [55] H. Yang, S. Gui, Y. Zhu, and J. Liu (2020) Automatic neural network compression by sparsity-quantization joint learning: a constrained optimization-based approach. In CVPR, pp. 2178–2188. Cited by: §2, §3.3.
  • [56] H. Yu, Q. Han, J. Li, J. Shi, G. Cheng, and B. Fan (2020) Search what you want: barrier panelty nas for mixed precision quantization. In ECCV, pp. 1–16. Cited by: §2, §3.3, §4.3.
  • [57] X. Yu, T. Liu, X. Wang, and D. Tao (2017) On compressing deep models by low rank and sparse decomposition. In CVPR, pp. 7370–7379. Cited by: §1.
  • [58] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018) Top-down neural attention by excitation backprop. IJCV 126 (10), pp. 1084–1102. Cited by: §2.
  • [59] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang (2019) Improving neural network quantization without retraining using outlier channel splitting. In ICML, pp. 7543–7552. Cited by: §1.
  • [60] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In CVPR, pp. 2921–2929. Cited by: §2.
  • [61] C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.