1 Introduction
Deep neural networks have achieved the stateofthe art performance across a large number of vision tasks such as image classification
[15, 44, 18], object detection [40, 28, 14][6, 49, 29] and many others. However, the mobile devices with limited storage and computational resources are not capable of processing deep models due to the extremely high complexity. Therefore, it is desirable to design network compression strategy according to the hardware configurations.Recently, several network compression techniques have been proposed including pruning [27, 16, 33], quantization [59, 30, 51], efficient architecture design [20, 17, 37] and lowrank decomposition [7, 57, 25]. Among these approaches, quantization constrains the network weights and activations in limited bitwidth for memory saving and fast processing. In order to fully utilize the hardware resources, mixedprecision quantization [50, 9, 3]
is presented to search the bitwidth in each layer so that the optimal accuracycomplexity tradeoff is obtained. However, conventional mixedprecision quantization requires the consistency of datasets for bitwidth search and network deployment to guarantee policy optimality, which causes significant search burden for automated model compression on largescale datasets such as ImageNet
[5]. For example, it usually takes several GPU days to acquire the expected quantization strategy for ResNet18 on ImageNet [50, 3].In this paper, we present a GMPQ method to learn generalizable mixedprecision quantization strategy via attribution rank preservation for efficient inference. Unlike existing methods which requires the dataset consistency between quantization policy search and model deployment, our method enables the acquired quantization strategy to be generalizable across various datasets. The quantization policy searched on small datasets achieves promising performance on challenging largescale datasets, so that policy search cost is significantly reduced. Figure 1(a) shows the difference between our GMPQ and conventional mixedprecision networks. More specifically, we observe that correctly locating the network attribution benefits visual analysis for various input data distribution. Therefore, despite of considering model accuracy and complexity, we enforce the quantized networks to imitate the attribution of the fullprecision counterparts. Instead of directly minimizing the Euclidean distance between attribution of quantized and fullprecision models, we preserve their attribution rank consistency so that the attribution of quantized networks can adaptively adjust the distribution without capacity insufficiency. Figure 1(b) demonstrates the attribution computed by Gradcam [42] for mixedprecision networks with optimal and random quantization policy and their fullprecision counterparts, where the mixedprecision networks with the optimal bitwidth assignment acquire more consistent attribution rank with the fullprecision model. Experimental results show that our GMPQ obtains competitive accuracycomplexity tradeoff on ImageNet and PASCAL VOC compared with the stateoftheart mixedprecision quantization methods in only several GPU hours.
2 Related Work
Fixedprecision quantization:
Network quantization has aroused extensive interests in computer vision and machine learning due to the significant reduction in computation and storage complexity, and existing methods are divided into onebit and multibit quantization. Binary networks constrain the network weights and activations in one bit at extremely high compression ratio. For the former, Hubara
et al. [19] and Courbariaux et al. [4]replaced the multiplyadd operations with xnorbitcount via weight and activation binarization, and applied the straightthrough estimators (STE) to optimize network parameters. Rastegari
et al. [39] leveraged the scaling factor for weight and activation hashing to minimize the quantization errors. Liu et al. [30] added extra shortcut between consecutive convolutional layers to enhance the network capacity. Wang et al. [54] mined the channelwise interactions to eliminate inconsistent signs in feature maps. Qin et al. [36] minimized the parameter entropy in inference and utilized the soft quantization in backward propagation to enhance the information retention. Since the performance gap between fullprecision and binary networks is huge, multibit networks are presented for better accuracyefficiency tradeoff. Zhu [61] trained an adaptive quantizer for network ternarization according to weight distribution. Gong et al. [12] applied the differentiable approximations for quantized networks to ensure the consistency between the optimization and the objective. Li et al. [24]proposed the fourbit networks for object detection with hardwarefriendly implementations, and overcome the training instabilities by custom batch normalization and outlier removal. However, the fixedprecision quantization ignores the redundancy variance across different layers and leads to suboptimal accuracycomplexity tradeoff in quantized networks.
Mixedprecision quantization: The mixedprecision networks assign different bitwidths to weights and activations in various layers, which considers the redundancy variance in different components to obatin the optimal accuracyefficiency tradeoff given hardware configurations. Existing mixedprecision quantization methods are mainly based on either nondifferentiable or differentiable search. For the former, Wang et al. [50]
presented a reinforcement learning model to learn the optimal bitwidth for weights and activations of each layer, where the model accuracy and complexity were considered in reward function. Wang
et al. [52]jointly searched the pruning ratio, the bitwidth and the architecture of the lightweight model from a hypernet via the evolutionary algorithms. Since the nondifferentiable methods require huge search cost to obtain the optimal bitwidth, the differentiable search approaches are also introduced in mixedprecision quantization. Cai
et al. [3] designed a hypernet where each convolutional layer consisted of parallel blocks in different bitwidths, which yielded the output by summing all blocks in various weights. Optimizing the block weight by back propagation and selecting the bitwidth with the largest value during inference achieved the optimal accuracycomplexity tradeoff. Moreover, Yu et al. [56] further presented a barrier penalty to ensure that the searched models were within the complexity constraint. Yang et al. [55] decoupled the constrained optimization via Alternating Direction Method of Multipliers (ADMM), and Wang et al. [53] utilized the variational information bottleneck to search for the proper bitwidth and pruning ratio. Habi et al. [13] and Van et al. [48] directly optimized the quantization intervals for bitwidth selection of mixedprecision networks. However, differentiable search for mixedprecision quantization still needs a large amount of time due to the optimization of the large hypernet. In order to solve this, Dong et al. [9, 8] designed bitwidth assignment rules according to Hessian information. Nevertheless, the handcrafted rules require expert knowledge and cannot adapt to the input data.Attribution methods: Attribution aims to produce humanunderstandable explanations for the predictions of neural networks. The contribution of each input component is calculated by examining the its influence on the network output, which is displayed as the attribution in 2D feature maps. Early works [10, 43, 60] analyzed the sensitivity and the significance of each pixel by leveraging its gradients with respect to the optimization objective. The recent studies on attribution extraction can be categorized into two types: gradientbased and relevancebased methods. For the first regard, Guided Backprop [45], GradCam [42] and integrated gradient [46] combined the pixel gradients across different locations and channels for information fusion, so that more accurate attribution was obtained. For the latter regard, Zhang et al. [58] constructed a hierarchical probabilistic model to mine the correlation between the input components and the prediction. In this paper, we observe that the attribution rank consistency of feature maps between vanilla and compressed networks benefits visual analysis for various data distribution, which is extended to generalizable mixedprecision quantization for significant search cost reduction.
3 Approach
In this section, we first introduce the mixedprecision quantization framework which suffers from significant search burden. Then we demonstrate the observation that the attribution rank consistency between fullprecision and quantized models benefits visual analysis for various data distribution. Finally, we present the generalizable mixedprecision quantization via attribution rank preservation.
3.1 MixedPrecision Quantization
The goal of mixedprecision quantization is to search the proper bitwidth of each layer in order to achieve the optimal accuracycomplexity tradeoff given hardware configurations. Let be the quantized network weight and be the quantization policy that assigns different bitwidths to weights and activations in various layers. means the computational complexity of the compressed networks with the quantization policy . The search objective function is written as the following bilevel optimization form:
(1) 
where and depict the task loss on the validation data and the training data. stands for the resource constraint of the deployment platform. In order to obtain the optimal mixedprecision networks, the quantization policy and the network weights are alternatively optimized until convergence or the maximal iteration number. Since the distribution of the training and validation data for policy search significantly affects the acquired quantization strategy, existing methods require the training and validation data for quantization policy search and those for model deployment to come from the same dataset. However, the compressed models are usually utilized on largescale datasets such as ImageNet, which causes heavy computational burden during quantization policy search. To address this, an ideal solution is to search for the quantization policy whose optimality is independent of the data distribution. The search objective should be modified in the following:
(2) 
where represents the task loss for network weight , quantization policy and input . depicts the dataset containing all validation images in deployment and illustrates the dataset including given training images in bitwidth search, where the distribution gap between and may be sizable. Because is intractable in realistic applications, it is desirable to find an alternative way to solve for the generalizable mixedprecision quantization policy.
3.2 Attribution Rank Consistency
Since acquiring all validation images in deployment is impossible, we solve for the generalizable mixedprecision quantization policy via an alternative way. We observe that correctly locating the network attribution benefits visual analysis for various input data distribution. The feature attribution is formulated according to the loss gradient with respect to each feature map, where the importance of the feature map in the last convolutional layer for recognizing the objects from the class is written as follows:
(3) 
where means the output score for input of the class, and represents the activation element in the row and column of the feature map in the last convolutional layer. is a scaling factor that normalizes the importance into the range . With the feature map visualization techniques presented in Gradcam [42], we obtain the feature attribution in the networks. We sum the feature maps from different channels with the attention weight calculated in (3
), and remove the influence from opposite pixels via the ReLU operation. The feature attribution in the last convolutional layer with respect to the
class is formulated in the following:(4) 
The feature attribution only preserves the supportive features for the given class, and the negative features related to other classes are removed.
The fullprecision networks achieve high performance due to paying more attention to important parts in the image, while the quantized models deviate the attribution from that of the fullprecision networks due to the limited capacity. Figure 2 demonstrates the attribution of networks with the optimal quantization policy in different complexity, where attribution of networks in lower capacity is more concentrated due to the limited carried information. As the network capacity gap between the quantized networks and their fullprecision counterparts is huge, directly enforcing the attribution consistency fails to remove the redundant attention in the compressed model, which causes capacity insufficiency with performance degradation. Therefore, we preserve the attribution rank consistency between the quantized networks and their fullprecision counterparts for generalizable mixedprecision quantization policy search. The attribution rank illustrates the importance order of different pixels for model predictions. Constraining attribution rank consistency enables the quantized networks to focus on important regions, which adaptively adjusts the attribution distribution without capacity insufficiency.
3.3 Generalizable MixedPrecision Quantization via Attribution Rank Preservation
Our GMPQ can be leveraged as a plugandplay module for both nondifferentiable and differentiable search methods. Since differentiable methods achieve the competitive accuracycomplexity tradeoff compared with nondifferentiable approaches, we employ the differentiable search framework [3, 56, 55] to select the optimal mixedprecision quantization policy. We design a hypernet with and parallel branches for convolution filters and feature maps in the layer. and represent the size of the search space for weight and activation bitwidths. The parallel branches are assigned with various bitwidths whose output is summed with the importance and for weight and activation respectively to form the intermediate feature maps. Figure 3 depicts the pipeline of our GMPQ. The feedforward propagation for each layer in the layer hypernet is written as follows:
(5) 
where means the output intermediate feature maps of the layer. represents the output of the activation quantization branch in the layer, and is the convolution operation in the filter branch of the layer. and stand for the importance weight for the quantized activation and filter branch in the layer.
As we observe that the attribution rank consistency between quantized networks and their fullprecision counterparts enables the compressed models to possess the discriminative power of the vanilla model regardless of the data distribution, we impose the attribution rank consistency constraint in optimal quantization policy search despite of the accuracy and efficiency objective. In order to obtain the optimal accuracycomplexity tradeoff for generalizable mixedprecision quantization, the learning objective is formulated in the Lagragian form:
(6) 
where , and respectively mean the classification, complexity and the generalization risk for the networks with weight and quantization policy for the input . and
are the hyperparameters to balance the importance of the complexity risk and generalization risk in the overall learning objective. In differentiable policy search,
is represented by the objective of vision tasks, and is defined as the expected Bitoperations (BOPs) [53, 1, 3]:(7) 
where and stand for the bitwidth of the branch of weights and activations in the layer, and means the BOPs of the layer in the fullprecision network. represents the number of layers of the quantized model. As the attribution rank consistency between the fullprecision networks and their quantized counterparts enhance the generalizability of the mixedprecision quantization policy, we define the generalization risk in the following form:
where represents the pixel attribution in the row and column of the feature maps with respect to the class in the quantized networks, and demonstrates the corresponding variable in fullprecision models. means the label of the input , and is the elementwise norm. stands for the attribution rank, which equals to if the element is the largest in the attribution map. We only preserve the attribution rank consistency for topk pixels with the highest attribution in the fullprecision networks, as low attribution is usually caused by noise without clear information. Since minimizing the generalization risk is NPhard, we present the capacityaware attribution imitation to differentially optimize the objective.
We enforce attribution of the mixedprecision networks to approach the norm of that in fullprecision models, because the norm preserves the rank consistency while adaptively selects the attribution distribution according to the network capacity. The generalization risk is rewritten as follows for efficient optimization:
Large leads to concentrated attribution and vice versa, and we assign with larger value for hypernets in lower capacity with hyperparamters and for Llayer networks:
(8) 
Since the classification, complexity and generalization risks are all differentiable, we optimize the hypernet weight and the branch importance weight iteratively in an endtoend manner. When the hypernet converges or achieves the maximum training epoch, the bitwidth represented by the branch with the largest important weight is selected to form the final quantization policy. We finetune the quantized networks with the data in deployment to acquire the final model applied in realistic applications. GMPQ searches quantization policies on small datasets with generalization constraint, which leads to high performance on largescale datasets in deployment with significantly reduced search cost.
4 Experiments
In this section, we conducted extensive experiments on image classification and object detection. We first introduce the implementation details of our GMPQ. In the following ablation study, we then evaluated the influence of value assignment strategy for in the capacityaware attribution imitation, investigated the effects of different terms in the risk function and discovered the impact of the dataset for quantization policy search. Finally, we compare our method with the stateoftheart mixedprecision networks on image classification and object detection with respect to accuracy, model complexity and search cost.
4.1 Datasets and Implementation Details
We first introduce the datasets that we carried experiments on. For quantization policy search, we employed the small datasets including CIFAR10
[23], Cars [22], Flowers [34], Aircraft [32], Pets [35] and Food [2]. CIFAR10 contains images divided into categories with equal number of samples, and Flowers have 8,189 images spread over 102 flower categories. Cars includes images with types at the level of maker, model and year, and Aircraft contains images with samples for each of the aircraft model variants. Pet was created with dog and cat categories with images for each class, and Food contains highresolution food photos of menu items from the restaurants.For mixedprecision network deployment, we evaluated the quantized networks on ImageNet for image classification and on PASCAL VOC for object detection. ImageNet [5] approximately contains billion and k images for training and validation from categories. For training, random region crops were applied from the resized image whose shorter side was . During the inference stage, we utilized the center crop. The PASCAL VOC dataset [11] collects images from categories, where we fintuned our mixedprecision networks on VOC 2007 and VOC 2012 trainval sets containing about k images and tested our GMPQ on VOC 2007 test set consisting of k samples. Following [11]
, we used the mean average precision (mAP) as the evaluation metric.
We trained our GMPQ with MobileNetV2 [41], ResNet18 and ResNet50 [15] architectures for image classification, and applied VGG16 [44] with SSD framework [28] and ResNet18 with Faster RCNN [40] for object detection. The bitwidth in the search space for network weights and activations is  bit for MobileNetV2 and  bit for other architectures. Inspired by [3], we utilized compositional convolution whose filters were weighted sum of each quantized filters in different bitwidths, so that complex parallel convolution was avoided. We updated the importance weight of different branches and the network parameters simultaneously. The hyperparameters and in capacityaware attribution imitation were set to and respectively. Meanwhile, we only minimize the distance between attribution in quantized networks and norm of that in fullprecision model for top pixels with the highest attribution in the realvalued model. For evaluation on ImageNet, we finetuned the mixedprecision networks with the Adam [21] optimizer. The learning rate started from and decayed twice by multiplying at the and epoch out of the total epochs. For object detection, the backbone was pretrained on ImageNet and then finetuned on PASCAL VOC with the same hyperparameter settings on image classification. The batchsize was set to be in all experiments. By adjusting the hyperparameters and in (6), we obtained the mixedprecision networks at different accuracycomplexity tradeoffs.
4.2 Ablation Study
In order to investigate the effectiveness of attribution rank preservation, we assign the value of in the capacityaware attribution imitation with different strategies. By varying the hyperparameters and in the overall risk (6), we evaluated the influence of classification, complexity and generalization risks with respect to the model accuracy and efficiency. We conducted the ablation study on ImageNet with the ResNet18 architecture, and searched the mixedprecision quantization policy on CIFAR10 for the above investigation. Moreover, we searched the generalizable mixedprecision quantization policy on different small datasets to discover the effects on the accuracycomplexity tradeoff and search cost.
Effectiveness of different value assignment strategies for : To investigate the influence of value assignment strategies to on the accuracycomplexity tradeoff, we searched the mixedprecision quantization policy with fixed and capacityaware value. For fixed , we set the value as , , and that constrains the attribution of quantized networks with various concentration. The capacityaware strategy assigns with the strategy shown in (8), where the product of and was varied in the ablation study. Figure 5(a) and 5(b) demonstrate the accuracycomplexity tradeoff for fixed and capacityaware value assignment strategies for respectively with different hyperparameters. The optimal accuracycomplexity curve in capacityaware strategy outperforms that in fixed strategy, which indicates the importance of attribution variation with respect to network capacity. For fixed strategy, medium outperforms other values. Small causes attention redundancy for quantized networks with limited capacity and large leads to information loss that fails to utilize the network capacity. For capacityaware strategy, setting the product of and to results in the optimal accuracycomplexity tradeoff. For hypernetworks whose product of weight and activation bitwidths is , the network capacity is comparable with their fullprecision counterparts since they mimic the attribution of realvalued models without extra concentration.
Methods  Param.  BOPs  Comp.  Top1  Top5  Cost. 
ResNet18  
Fullprecision  
ALQ  
HAWQ  
GMPQ  
APoT  
GMPQ  
ALQ  
EdMIPS  
EdMIPSC  
GMPQ  
ResNet50  
Fullprecision  
HAWQ  
HAQ  
BPNAS  
GMPQ  
HMQ  
BPNAS  
GMPQ  
EdMIPS  
EdMIPSC  
GMPQ  
MobileNetV2  
Fullprecision  
RQ  
GMPQ  
HAQ  
HAQC  
DJPQ  
GMPQ  
HMQ  
DQ  
GMPQ 
Influence of hyperparameters in overall risk (6): In order to verify the effectiveness of the generalization risk, we report the performance with different . Meanwhile, we also varied the hyperparameter to obtain different accuracycomplexity tradeoffs. Figure 6(a) illustrates the results, where medium achieves the best tradeoff curve. Large fails to leverage the supervision from annotated labels, and small ignores the attribution rank consistency which enhances the generalization ability of the mixedprecision quantization policy. With the increase of , the resulted policy prefers lightweight architectures and vice versa. For different , the same assignment of selects similar BOPs in the accuracycomplexity tradeoff.
Effects of datasets for quantization policy search: We searched the mixedprecision quantization policy on different small datasets including CIFAR10, Cars, Flowers, Aircraft, Pets and Food to discover the effects on model accuracy and efficiency. Figure 6(b) demonstrates the top1 accuracy and the BOPs for the optimal mixedprecision networks obtained on different small datasets. We also show the average search cost across all computation cost constraint in the legend, where GH means GPU hours that measures the search cost. The mixedprecision networks searched on CIFAR10 achieves the best accuracyefficiency tradeoff, because the size of CIFAR10 is the largest with the most sufficient visual information. Moreover, the gap of object category between CIFAR10 and ImageNet is the smallest compared with other datasets. Searching quantization policy on Aircraft requires the highest search cost due to the large image size .
Methods  Param.  BOPs  Comp.  mAP  Cost 
SSD & VGG16  
Fullprecision  
HAQ  
HAQC  
EdMIPS  
EdMIPSC  
GMPQ  
HAQ  
HAQC  
EdMIPS  
EdMIPSC  
GMPQ  
Faster RCNN & ResNet18  
Fullprecision  
HAQ  
HAQC  
EdMIPS  
EdMIPSC  
GMPQ  
HAQ  
HAQC  
EdMIPS  
EdMIPSC  
GMPQ 
4.3 Comparison with Stateoftheart Methods
In this section, we compare our GMPQ with the stateoftheart fixedprecision models containing APoT [26] and RQ [31] and mixedprecision networks including ALQ [38], HAWQ [9], EdMIPS [3], HAQ [50], BPNAS [56], HMQ [13] and DQ [47] on ImageNet for image classification and on PASCAL VOC for object detection. We also provide the performance of fullprecision models for reference. The accuracycomplexity tradeoffs of baselines are copied from their original papers or obtained by our implementation with the officially released code, and the search cost was evaluated by rerunning the compared methods. We searched the optimal quantization policy on CIFAR10 for the deployment on ImageNet and PASCAL VOC.
Results on ImageNet: Table 1 illustrates the comparison of storage and computational cost, the compression ratio of BOPs, the top1 and top5 accuracy and the search cost across different architectures and mixedprecision quantization methods. HAQC and EdMIPSC demonstrate that we leveraged HAQ and EdMIPS that searched the quantization policy on CIFAR10 and evaluated the obtained quantization policy on ImageNet. By comparing the accuracycomplexity tradeoff with the baseline methods for different architectures, we conclude that our GMPQ achieves the competitive accuracycomplexity tradeoff under various resource constraint with significantly reduced search cost. Meanwhile, we also searched the quantization policy on CIFAR10 directly using HAQ and EdMIPS. Although the search cost is reduced sizably, the accuracycomplexity tradeoff is far from the optimal across various resource constraint, which indicates the lack of generalization ability for the quantization policy obtained by the conventional methods. Our GMPQ preserves the attribution rank consistency during the quantization policy search with acceptable computational overhead, and enables the mixedprecision quantization searched on small datasets to generalize to largescale datasets. For the mixedprecision quantization method EdMIPS, the search cost reduction is more obvious for ResNet50 since the heavy architecture requires more training epochs to converge when trained on largescale datasets.
Results on PASCAL VOC: We employed the SSD detection framework with VGG16 architecture and Faster RCNN detector with ResNet18 backbone to evaluate our GMPQ on object detection. Table 2 shows the results of various mixedprecision networks. Compared with the accuracycomplexity tradeoff searched on PASCAL VOC by the stateoftheart methods, our GMPQ acquired the competitive results with significantly reduced search cost on both detection frameworks and backbones. Moreover, directly compressing the networks with the quantization policy searched by HAQ and EdMIPS on CIFAR10 degrades the performance significantly. Since the mixedprecision networks are required to be pretrained on ImageNet, the search cost decrease on PASCAL VOC is more sizable than that on ImageNet. Because the twostage detector Faster RCNN has stronger discriminative power for accurate attribution generation, the accuracycomplexity tradeoff is more optimal compared with the onestage detector.
5 Conclusion
In this paper, we have proposed a generalizable mixedquantization method called GMPQ for efficient inference. The presented GMPQ searches the quantization policy on small datasets with attribution rank preservation, so that the acquired quantization strategy can be generalized to achieve the optimal accuracycomplexity tradeoff on largescale datasets with significant search cost reduction. Extensive experiments depict the superiority of GMPQ compared with the stateoftheart methods.
Acknowledgements
This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, and Grant U1713214, in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI), and in part by a grant from the Institute for Guo Qiang, Tsinghua University.
References
 [1] (2020) MeliusNet: can binary neural networks achieve mobilenetlevel accuracy?. arXiv preprint arXiv:2001.05936. Cited by: §3.3.

[2]
(2014)
Food101–mining discriminative components with random forests
. In ECCV, pp. 446–461. Cited by: §4.1.  [3] (2020) Rethinking differentiable search for mixedprecision neural networks. In CVPR, pp. 2349–2358. Cited by: §1, §2, §3.3, §3.3, §4.1, §4.3.
 [4] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §2.
 [5] (2009) Imagenet: a largescale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §4.1.
 [6] (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, pp. 4690–4699. Cited by: §1.
 [7] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. arXiv preprint arXiv:1404.0736. Cited by: §1.
 [8] (2019) Hawqv2: hessian aware traceweighted quantization of neural networks. arXiv preprint arXiv:1911.03852. Cited by: §2.
 [9] (2019) Hawq: hessian aware quantization of neural networks with mixedprecision. In ICCV, pp. 293–302. Cited by: §1, §2, §4.3.
 [10] (2009) Visualizing higherlayer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §2.
 [11] (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §4.1.
 [12] (2019) Differentiable soft quantization: bridging fullprecision and lowbit neural networks. In ICCV, pp. 4852–4861. Cited by: §2.
 [13] (2020) HMQ: hardware friendly mixed precision quantization block for cnns. arXiv preprint arXiv:2007.09952. Cited by: §2, §4.3.
 [14] (2017) Mask rcnn. In ICCV, pp. 2961–2969. Cited by: §1.
 [15] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §4.1.
 [16] (2017) Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1389–1397. Cited by: §1.

[17]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §1.  [18] (2017) Densely connected convolutional networks. In CVPR, pp. 4700–4708. Cited by: §1.
 [19] (2016) Binarized neural networks. In NIPS, pp. 4114–4122. Cited by: §2.
 [20] (2016) SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
 [21] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 [22] (2013) 3d object representations for finegrained categorization. In ICCVW, pp. 554–561. Cited by: §4.1.
 [23] (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.
 [24] (2019) Fully quantized network for object detection. In CVPR, pp. 2810–2819. Cited by: §2.
 [25] (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. In CVPR, pp. 8018–8027. Cited by: §1.
 [26] (2020) Additive powersoftwo quantization: a nonuniform discretization for neural networks. ICLR. Cited by: §4.3.
 [27] (2017) Runtime neural pruning. In NIPS, pp. 2178–2188. Cited by: §1.
 [28] (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1, §4.1.
 [29] (2017) Sphereface: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §1.
 [30] (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In ECCV, pp. 722–737. Cited by: §1, §2.
 [31] (2018) Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875. Cited by: §4.3.
 [32] (2013) Finegrained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: §4.1.
 [33] (2019) Importance estimation for neural network pruning. In CVPR, pp. 11264–11272. Cited by: §1.
 [34] (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §4.1.
 [35] (2012) Cats and dogs. In CVPR, pp. 3498–3505. Cited by: §4.1.
 [36] (2020) Forward and backward information retention for accurate binary neural networks. In CVPR, pp. 2250–2259. Cited by: §2.
 [37] (2019) Thundernet: towards realtime generic object detection on mobile devices. In ICCV, pp. 6718–6727. Cited by: §1.
 [38] (2020) Adaptive lossaware quantization for multibit networks. In CVPR, pp. 7988–7997. Cited by: §4.3.
 [39] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, pp. 525–542. Cited by: §2.
 [40] (2015) Faster rcnn: towards realtime object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §1, §4.1.
 [41] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §4.1.
 [42] (2017) Gradcam: visual explanations from deep networks via gradientbased localization. In ICCV, pp. 618–626. Cited by: §1, §2, §3.2.
 [43] (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2.
 [44] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §4.1.
 [45] (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.
 [46] (2017) Axiomatic attribution for deep networks. In ICML, pp. 3319–3328. Cited by: §2.
 [47] (2019) Differentiable quantization of deep neural networks. arXiv preprint arXiv:1905.11452. Cited by: §4.3.
 [48] (2020) Bayesian bits: unifying quantization and pruning. arXiv preprint arXiv:2005.07093. Cited by: §2.
 [49] (2018) Cosface: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §1.
 [50] (2019) Haq: hardwareaware automated quantization with mixed precision. In CVPR, pp. 8612–8620. Cited by: §1, §2, §4.3.
 [51] (2020) Towards accurate posttraining network quantization via bitsplit and stitching. In ICML, pp. 9847–9856. Cited by: §1.
 [52] (2020) Apq: joint search for network architecture, pruning and quantization policy. In CVPR, pp. 2078–2087. Cited by: §2.
 [53] (2020) Differentiable joint pruning and quantization for hardware efficiency. In ECCV, pp. 259–277. Cited by: §2, §3.3.
 [54] (2019) Learning channelwise interactions for binary convolutional neural networks. In CVPR, pp. 568–577. Cited by: §2.
 [55] (2020) Automatic neural network compression by sparsityquantization joint learning: a constrained optimizationbased approach. In CVPR, pp. 2178–2188. Cited by: §2, §3.3.
 [56] (2020) Search what you want: barrier panelty nas for mixed precision quantization. In ECCV, pp. 1–16. Cited by: §2, §3.3, §4.3.
 [57] (2017) On compressing deep models by low rank and sparse decomposition. In CVPR, pp. 7370–7379. Cited by: §1.
 [58] (2018) Topdown neural attention by excitation backprop. IJCV 126 (10), pp. 1084–1102. Cited by: §2.
 [59] (2019) Improving neural network quantization without retraining using outlier channel splitting. In ICML, pp. 7543–7552. Cited by: §1.

[60]
(2016)
Learning deep features for discriminative localization
. In CVPR, pp. 2921–2929. Cited by: §2.  [61] (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.