ZeroQ
None
view repo
Quantization is a promising approach for reducing the inference time and memory footprint of neural networks. However, most existing quantization methods require access to the original training dataset for retraining during quantization. This is often not possible for applications with sensitive or proprietary data, e.g., due to privacy and security concerns. Existing zeroshot quantization methods use different heuristics to address this, but they result in poor performance, especially when quantizing to ultralow precision. Here, we propose ZeroQ , a novel zeroshot quantization framework to address this. ZeroQ enables mixedprecision quantization without any access to the training or validation data. This is achieved by optimizing for a Distilled Dataset, which is engineered to match the statistics of batch normalization across different layers of the network. ZeroQ supports both uniform and mixedprecision quantization. For the latter, we introduce a novel Pareto frontier based method to automatically determine the mixedprecision bit setting for all layers, with no manual search involved. We extensively test our proposed method on a diverse set of models, including ResNet18/50/152, MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3 on ImageNet, as well as RetinaNetResNet50 on the Microsoft COCO dataset. In particular, we show that ZeroQ can achieve 1.71% higher accuracy on MobileNetV2, as compared to the recently proposed DFQ method. Importantly, ZeroQ has a very low computational overhead, and it can finish the entire quantization process in less than 30s (0.5% of one epoch training time of ResNet50 on ImageNet). We have opensourced the ZeroQ framework[%s].
READ FULL TEXT VIEW PDF
Quantization is an effective method for reducing memory footprint and
in...
read it
Neural network quantization has an inherent problem called accumulated
q...
read it
We consider the posttraining quantization problem, which discretizes th...
read it
Quantization neural networks (QNNs) are very attractive to the industry
...
read it
Emergent hardwares can support mixed precision CNN models inference that...
read it
We propose a novel method for neural network quantization that casts the...
read it
Neural networks have demonstrably achieved stateofthe art accuracy usi...
read it
None
Despite the great success of deep Neural Network (NN) models in various domains, the deployment of modern NN models at the edge has been challenging due to their prohibitive memory footprint, inference time, and/or energy consumption. With the current hardware support for lowprecision computations, quantization has become a popular procedure to address these challenges. By quantizing the floating point values of weights and/or activations in a NN to integers, the model size can be shrunk significantly, without any modification to the architecture. This also allows one to use reducedprecision Arithmetic Logic Units (ALUs) which are faster and more powerefficient, as compared to floating point ALUs. More importantly, quantization reduces memory traffic volume, which is a significant source of energy consumption [14].
However, quantizing a model from single precision to lowprecision often results in significant accuracy degradation. One way to alleviate this is to perform the socalled quantizationaware finetuning [46, 34, 4, 42, 45, 18] to reduce the performance gap between the original model and the quantized model. Basically, this is a retraining procedure that is performed for a few epochs to adjust the NN parameters to reduce accuracy drop. However, quantizationaware finetuning can be computationally expensive and timeconsuming. For example, in online learning situations, where a model needs to be constantly updated on new data and deployed every few hours, there may not be enough time for the finetuning procedure to finish. More importantly, in many realworld scenarios, the training dataset is sensitive or proprietary, meaning that it is not possible to access the dataset that was used to train the model. Good examples are medical data, biometric data, or user data used in recommendation systems.
To address this, recent work has proposed posttraining quantization [19, 32, 44, 2], which directly quantizes NN models without finetuning. However, as mentioned above, these methods result in nontrivial performance degradation, especially for lowprecision quantization. Furthermore, previous posttraining quantization methods usually require limited (unlabeled) data to assist the posttraining quantization. However, for cases such as MLaaS (e.g., Amazon AWS and Google Cloud), it may not be possible to access any of the training data from users. An example application case is health care information which cannot be uploaded to the cloud due to various privacy issues and/or regulatory constraints. Another shortcoming is that often postquantization methods [30, 44, 2] only focus on standard NNs such as ResNet [12] and InceptionV3 [38] for image classification, and they do not consider more demanding tasks such as object detection.
In this work, we propose ZeroQ, a novel zeroshot quantization scheme to overcome the issues mentioned above. In particular, ZeroQ allows quantization of NN models, without any access to any training/validation data. It uses a novel approach to automatically compute a mixedprecision configuration without any expensive search. In particular, our contributions are as follows.
[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]
We propose an optimization formulation to generate Distilled Data, i.e., synthetic data engineered to match the statistics of batch normalization layers. This reconstruction has a small computational overhead. For example, it only takes 3s (0.05% of one epoch training time) to generate 32 images for ResNet50 on ImageNet on an 8V100 system.
We use the above reconstruction framework to perform sensitivity analysis between the quantized and the original model. We show that the Distilled Data matches the sensitivity of the original training data (see Figure 1 and Table IV for details). We then use the Distilled Data, instead of original/real data, to perform posttraining quantization. The entire sensitivity computation here only costs 12s (0.2% of one epoch training time) in total for ResNet50. Importantly, we never use any training/validation data for the entire process.
Our framework supports both uniform and mixedprecision quantization. For the latter, we propose a novel automatic precision selection method based on a Pareto frontier optimization (see Figure 4 for illustration). This is achieved by computing the quantization sensitivity based on the Distilled Data with small computational overhead. For example, we are able to determine automatically the mixedprecision setting in under 14s for ResNet50.
We extensively test our proposed ZeroQ framework on a wide range of NNs for image classification and object detection tasks, achieving stateoftheart quantization results in all tests. In particular, we present quantization results for both standard models (e.g., ResNet18/50/152 and InceptionV3) and efficient/compact models (e.g., MobileNetV2, ShuffleNet, and SqueezeNext) for image classification task. Importantly, we also test ZeroQ for object detection on Microsoft COCO dataset [28] with RetinaNet [27]. Among other things, we show that ZeroQ achieves 1.71% higher accuracy on MobileNetV2 as compared to the recently proposed DFQ [32] method.
Here we provide a brief (and by no means extensive) review of the related work in literature. There is a wide range of methods besides quantization which have been proposed to address the prohibitive memory footprint and inference latency/power of modern NN architectures. These methods are typically orthogonal to quantization, and they include efficient neural architecture design [17, 8, 15, 36, 43], knowledge distillation [13, 35], model pruning [10, 29, 24], and hardware and NN codesign [8, 21]. Here we focus on quantization [1, 5, 34, 41, 23, 48, 45, 46, 4, 7, 42], which compresses the model by reducing the bit precision used to represent parameters and/or activations. An important challenge with quantization is that it can lead to significant performance degradation, especially in ultralow bit precision settings. To address this, existing methods propose quantizationaware finetuning to recover lost performance [20, 18, 3]. Importantly, this requires access to the full dataset that was used to train the original model. Not only can this be very timeconsuming, but often access to training data is not possible.
To address this, several papers focused on developing posttraining quantization methods (also referred to as postquantization), without any finetuning/training. In particular, [19] proposes the OMSE method to optimize the
distance between the quantized tensor and the original tensor. Moreover,
[2] proposed the socalled ACIQ method to analytically compute the clipping range, as well as the perchannel bit allocation for NNs, and it achieves relatively good testing performance. However, they use perchannel quantization for activations, which is difficult for efficient hardware implementation in practice. In addition, [44]proposes an outlier channel splitting (OCS) method to solve the outlier channel problem. However, these methods require access to limited data to reduce the performance drop
[19, 2, 44, 30, 22].The recent work of [32] proposed Data Free Quantization (DFQ). It further pushes postquantization to zeroshot scenarios, where neither training nor testing data are accessible during quantization. The work of [32] uses a weight equalization scheme [30] to remove outliers in both weights and activations, and they achieve similar results with layerwise quantization, as compared to previous postquantization work with channelwise quantization [20]. However, [32] their performance significantly degrades when NNs are quantized to 6bit or lower.
A recent concurrent paper to ours independently proposed to use Batch Normalization statistics to reconstruct input data [11]. They propose a knowledgedistillation based method to boost the accuracy further, by generating input data that is similar to the original training dataset, using the socalled Inceptionism [31]. However, it is not clear how the latter approach can be used for tasks such as object detection or image segmentation. Furthermore, this knowledgedistillation process adds to the computational time required for zeroshot quantization. As we will show in our work, it is possible to use batch norm statistics combined with mixedprecision quantization to achieve stateoftheart accuracy, and importantly this approach is not limited to image classification task. In particular, we will present results on object detection using RetinaNetResNet50, besides testing ZeroQ on a wide range of models for image classification (using ResNet18/50/152, MobileNetV2, ShuffleNet, SqueezeNext, and InceptionV3), We show that for all of these cases ZeroQ exceeds stateoftheart quantization performance. Importantly, our approach has a very small computational overhead. For example, we can finish ResNet50 quantization in under 30 seconds on an 8 V100 system (corresponding to 0.5% of one epoch training time of ResNet50 on ImageNet).
Directly quantizing all NN layers to low precision can lead to significant accuracy degradation. A promising approach to address this is to perform mixedprecision quantization [7, 6, 40, 47, 39], where different bitprecision is used for different layers. The key idea behind mixedprecision quantization is that not all layers of a convolutional network are equally “sensitive” to quantization. A naïve mixedprecision quantization method can be computationally expensive, as the search space for determining the precision of each layer is exponential in the number of layers. To address this, [39] uses NAS/RLbased search algorithm to explore the configuration space. However, these searching methods can be expensive and are often sensitive to the hyperparameters and the initialization of the RL based algorithm. Alternatively, the recent work of [7, 37, 6] introduces a Hessian based method, where the bit precision setting is based on the secondorder sensitivity of each layer. However, this approach does require access to the original training set, a limitation which we address in ZeroQ.
For a typical supervised computer vision task, we seek to minimize the empirical risk loss, i.e.,
(1) 
where is the learnable parameter,
is the loss function (typically crossentropy loss),
is the training input/label pair, is the NN model with layers, and is the total number of training data points. Here, we assume that the input data goes through standard preprocessing normalization of zero mean () and unit variance (
). Moreover, we assume that the model has BN layers denoted as , , …, . We denote the activations before the ith BN layer with (in other words is the output of the ith convolutional layer). During inference, is normalized by the running mean () and variance () of parameters in the ith BN layer (), which is precomputed during the training process. Typically BN layers also include scaling and bias correction, which we denote as and , respectively.We assume that before quantization, all the NN parameters and activations are stored in 32bit precision and that we have no access to the training/validation datasets. To quantize a tensor (either weights or activations), we clip the parameters to a range of (), and we uniformly discretize the space to even intervals using asymmetric quantization. That is, the length of each interval will be . As a result, the original 32bit singleprecision values are mapped to unsigned integers within the range of . Some work has proposed nonuniform quantization schemes which can capture finer details of weight/activation distribution [33, 9, 42]. However, we only use asymmetric uniform quantization, as the nonuniform methods are typically not suitable for efficient hardware execution.
The ZeroQ framework supports both fixedprecision and mixedprecision quantization. In the latter scheme, different layers of the model could have different bit precisions (different ). The main idea behind mixedprecision quantization is to keep more sensitive layers at higher precision, and more aggressively quantize less sensitive layers, without increasing overall model size. As we will show later, this mixedprecision quantization is key to achieving high accuracy for ultralow precision settings such as 4bit quantization. Typical choices for for each layer are bit. Note that this mixedprecision quantization leads to exponentially large search space, as every layer could have one of these bit precision settings. It is possible to avoid this prohibitive search space if we could measure the sensitivity of the model to the quantization of each layer [7, 37, 6]. For the case of posttraining quantization (i.e. without finetuning), a good sensitivity metric is to use Kullback–Leibler (KL) divergence between the original model and the quantized model, defined as:
(2) 
where measures how sensitive the th layer is when quantized to kbit, and refers to quantized model parameters in the th layer with bit precision. If is small, the output of the quantized model will not significantly deviate from the output of the full precision model when quantizing the th layer to bits, and thus the th layer is relatively insensitive to kbit quantization, and vice versa. This process is schematically shown in Figure 1 for ResNet18. However, an important problem is that for zeroshot quantization we do not have access to the original training dataset in Eq. 2. We address this by “distilling” a synthetic input data to match the statistics of the original training dataset, which we refer to as Distilled Data. We obtain the Distilled Data by solely analyzing the trained model itself, as described below.
For zeroshot quantization, we do not have access to any of the training/validation data. This poses two challenges. First, we need to know the range of values for activations of each layer so that we can clip the range for quantization (the range mentioned above). However, we cannot determine this range without access to the training dataset. This is a problem for both uniform and mixedprecision quantization. Second, another challenge is that for mixedprecision quantization, we need to compute in Eq. 2, but we do not have access to training data
. A very naïve method to address these challenges is to create a random input data drawn from a Gaussian distribution with zero mean and unit variance and feed it into the model. However, this approach cannot capture the correct statistics of the activation data corresponding to the original training dataset. This is illustrated in Figure
2 (left), where we plot the sensitivity of each layer of ResNet50 on ImageNet measured with the original training dataset (shown in black) and Gaussian based input data (shown in red). As one can see, the Gaussian data clearly does not capture the correct sensitivity of the model. For instance, for the first three layers, the sensitivity order of the red line is actually the opposite of the original training data.To address this problem, we propose a novel method to “distill” input data from the NN model itself, i.e., to generate synthetic data carefully engineered based on the properties of the NN. In particular, we solve a distillation optimization problem, in order to learn an input data distribution that best matches the statistics encoded in the BN layer of the model. In more detail, we solve the following optimization problem:
(3) 
where is the reconstructed (distilled) input data, and /
are the mean/standard deviation of the Distilled Data’s distribution at layer
, and / are the corresponding mean/standard deviation parameters stored in the BN layer at layer . In other words, after solving this optimization problem, we can distill an input data which, when fed into the network, can have a statistical distribution that closely matches the original model. Please see Algorithm 1 for a description. This Distilled Data can then be used to address the two challenges described earlier. First, we can use the Distilled Data’s activation range to determine quantization clipping parameters (the range mentioned above). Note that some prior work [2, 22, 44] address this by using limited (unlabeled) data to determine the activation range. However, this contradicts the assumptions of zeroshot quantization, and may not be applicable for certain applications. Second, we can use the Distilled Data and feed it in Eq. 2 to determine the quantization sensitivity (). The latter is plotted for ResNet50 in Figure 2 (left) shown in solid blue color. As one can see, the Distilled Data closely matches the sensitivity of the model as compared to using Gaussian input data (shown in red). We show a visualization of the random Gaussian data as well as the Distilled Data for ResNet50 in Figure 3. We can see that the Distilled Data can capture finegrained local structures.As mentioned before, the main challenge for mixedprecision quantization is to determine the exact bit precision configuration for the entire NN. For an Llayer model with possible precision options, the mixedprecision search space, denoted as , has an exponential size of . For example for ResNet50 with just three bit precision of (i.e., ), the search space contains configurations. However, we can use the sensitivity metric in Eq. 2 to reduce this search space. The main idea is to use higher bit precision for layers that are more sensitive, and lower bit precision for layers that are less sensitive. This gives us a relative ordering on the number of bits. To compute the precise bit precision setting, we propose a Pareto frontier approach similar to the method used in [6].
The Pareto frontier method works as follows. For a target quantized model size of , we measure the overall sensitivity of the model for each bit precision configuration that results in the model size. We choose the bitprecision setting that corresponds to the minimum overall sensitivity. In more detail, we solve the following optimization problem:
(4) 
where is the quantization precision of the ith layer, and is the parameter size for the th layer. Note that here we make the simplifying assumption that the sensitivity of different layers are independent of the choice of bits for other layers (hence only depends on the bit precision for the th layer).^{2}^{2}2Please see Section A where we describe how we relax this assumption without having to perform an exponentially large computation for the sensitivity for each bit precision setting. Using a dynamic programming method we can solve the best setting with different together, and then we plot the Pareto frontier. An example is shown in Figure 4 for ResNet50 model, where the xaxis is the model size for each bit precision configuration, and the yaxis is the overall model perturbation/sensitivity. Each blue dot in the figure represents a mixedprecision configuration. In ZeroQ, we choose the bit precision setting that has the smallest perturbation with a specific model size constraint.
Importantly, note that the computational overhead of computing the Pareto frontier is . This is because we compute the sensitivity of each layer separately from other layers. That is, we compute sensitivity () with respect to all different precision options, which leads to the computational complexity. We should note that this Pareto Frontier approach (including the Dynamic Programming optimizer), is not theoretically guaranteed to result in the best possible configuration, out of all possibilities in the exponentially large search space. However, our results show that the final mixedprecision configuration achieves stateoftheart accuracy with small performance loss, as compared to the original model in single precision.


In this section, we extensively test ZeroQ on a wide range of models and datasets. We first start by discussing the zeroshot quantization of ResNet18/50, MobileNetV2, and ShuffleNet on ImageNet in Section IVA. Additional results for quantizing ResNet152, InceptionV3, and SqueezeNext on ImageNet, as well as ResNet20 on Cifar10 are provided in Appendix C. We also present results for object detection using RetinaNet tested on Microsoft COCO dataset in Section IVB. We emphasize that all of the results achieved by ZeroQ are 100% zeroshot without any need for finetuning.
We also emphasize that we used exactly the same hyperparameters (e.g., the number of iterations to generate Distilled Data) for all experiments, including the results on Microsoft COCO dataset.
We start by discussing the results on the ImageNet dataset. For each model, after generating Distilled Data based on Eq. 3, we compute the sensitivity of each layer using Eq. 2 for different bit precision. Next, we use Eq. 4 and the Pareto frontier introduced in Section IIIB to get the best bitprecision configuration based on the overall sensitivity for a given model size constraint. We denote the quantized results as WwAh where w and h denote the bit precision used for weights and activations of the NN model.
We present zeroshot quantization results for ResNet50 in Table (a)a. As one can see, for W8A8 (i.e., 8bit quantization for both weights and activations), ZeroQ results in only 0.05% accuracy degradation. Further quantizing the model to W6A6, ZeroQ achieves 77.43% accuracy, which is 2.63% higher than OCS [44], even though our model is slightly smaller (18.27MB as compared to 18.46MB for OCS).^{3}^{3}3Importantly note that OCS requires access to the training data, while ZeroQ does not use any training/validation data. We show that we can further quantize ResNet50 down to just 12.17MB with mixed precision quantization, and we obtain 75.80% accuracy. Note that this is 0.82% higher than OMSE [19] with access to training data and 5.74% higher than zeroshot version of OMSE. Importantly, note that OMSE keeps activation bits at 32bits, while for this comparison our results use 8bits for the activation (i.e., smaller activation memory footprint than OMSE). For comparison, we include results for PACT [4], a standard quantization method that requires access to training data and also requires finetuning.
An important feature of the ZeroQ framework is that it can perform the quantization with very low computational overhead. For example, the endtoend quantization of ResNet50 takes less than 30 seconds on an 8 Tesla V100 GPUs (one epoch training time on this system takes 100 minutes). In terms of timing breakdown, it takes 3s to generate the Distilled Data, 12s to compute the sensitivity for all layers of ResNet50, and 14s to perform Pareto Frontier optimization.
We also show ZeroQ results on MobileNetV2 and compare it with both DFQ [32] and finetuning based methods [33, 18], as shown in Table (b)b. For W8A8, ZeroQ has less than 0.12% accuracy drop as compared to baseline, and it achieves 1.71% higher accuracy as compared to DFQ method.
Further compressing the model to W6A6 with mixedprecision quantization for weights, ZeroQ can still outperform IntegerOnly [18] by 1.95% accuracy, even though ZeroQ does not use any data or finetuning. ZeroQ can achieve 68.83% accuracy even when the weight compression is 8, which corresponds to using 4bit quantization for weights on average.
We also experimented with percentile based clipping to determine the quantization range [25] (please see Section D for details). The results corresponding to percentile based clipping are denoted as and reported in Table I. We found that using percentile based clipping is helpful for low precision quantization. Other choices for clipping methods have been proposed in the literature. Here we note that our approach is orthogonal to these improvements and that ZeroQ could be combined with these methods.
We also apply ZeroQ to quantize efficient and highly compact models such as ShuffleNet, whose model size is only 5.94MB. To the best of our knowledge, there exists no prior zeroshot quantization results for this model. ZeroQ achieves a small accuracy drop of 0.13% for W8A8. We can further quantize the model down to an average of 4bits for weights, which achieves a model size of only 0.73MB, with an accuracy of 58.96%.
Method  No D  No FT  Wbit  Abit  Size (MB)  mAP 

Baseline  ✓  ✓  32  32  145.10  36.4 
FQN [25]  ✗  ✗  4  4  18.13  32.5 
ZeroQ  ✓  ✓  MP  8  18.13  33.7 
ZeroQ  ✓  ✓  MP  6  24.17  35.9 
ZeroQ  ✓  ✓  8  8  36.25  36.4 
We also compare with the recent DataFree Compression (DFC) [11] method. There are two main differences between ZeroQ and DFC. First, DFC proposes a finetuning method to recover accuracy for ultralow precision cases. This can be timeconsuming and as we show it is not necessary. In particular, we show that with mixedprecision quantization one can actually achieve higher accuracy without any need for finetuning. This is shown in Table III for ResNet18 quantization on ImageNet. In particular, note the results for W4A4, where the DFC method without finetuning results in more than 15% accuracy drop with a final accuracy of 55.49%. For this reason, the authors propose a method with post quantization training, which can boost the accuracy to 68.05% using W4A4 for intermediate layers, and 8bits for the first and last layers. In contrast, ZeroQ achieves a higher accuracy of 69.05% without any need for finetuning. Furthermore, the endtoend zeroshot quantization of ResNet18 takes only 12s on an 8V100 system (equivalent to of the 45 minutes time for one epoch training of ResNet18 on ImageNet). Secondly, DFC method uses Inceptionism [31] to facilitate the generation of data with random labels, but it is hard to extend this for object detection and image segmentation tasks.
Method  No D  No FT  Wbit  Abit  Size (MB)  Top1 
Baseline  –  –  32  32  44.59  71.47 
PACT [4]  ✗  ✗  4  4  5.57  69.20 
DFC [11]  ✓  ✓  4  4  5.58  55.49 
DFC [11]  ✓  ✗  4  4  5.58  68.06 
ZeroQ  ✓  ✓  MP  4  5.57  – 
✓  ✓  MP  4  5.57  69.05  
IntegerOnly[18]  ✗  ✗  6  6  8.36  67.30 
DFQ [32]  ✓  ✓  6  6  8.36  66.30 
ZeroQ  ✓  ✓  MP  6  8.35  71.30 
RVQuant [33]  ✗  ✗  8  8  11.15  70.01 
DFQ [32]  ✓  ✓  8  8  11.15  69.70 
DFC [11]  ✓  ✗  8  8  11.15  69.57 
ZeroQ  ✓  ✓  8  8  11.15  71.43 
We include additional results of quantized ResNet152, InceptionV3, and SqueezeNext on ImageNet, as well as ResNet20 on Cifar10, in Appendix C.
Object detection is often much more complicated than ImageNet classification. To demonstrate the flexibility of our approach we also test ZeroQ on an object detection task on Microsoft COCO dataset. RetinaNet [27] is a stateoftheart singlestage detector, and we use the pretrained model with ResNet50 as the backbone, which can achieve 36.4 mAP.^{4}^{4}4Here we use the standard mAP 0.5:0.05:0.95 metric on COCO dataset.
One of the main difference of RetinaNet with previous NNs we tested on ImageNet is that some convolutional layers in RetinaNet are not followed by BN layers. This is because of the presence of a feature pyramid network (FPN) [26], and it means that the number of BN layers is slightly smaller than that of convolutional layers. However, this is not a limitation and the ZeroQ framework still works well. Specifically, we extract the backbone of RetinaNet and create Distilled Data. Afterwards, we feed the Distilled Data into RetinaNet to measure the sensitivity as well as to determine the activation range for the entire NN. This is followed by optimizing for the Pareto Frontier, discussed earlier.
The results are presented in Table II. We can see that for W8A8 ZeroQ has no performance degradation. For W6A6, ZeroQ achieves 35.9 mAP. Further quantizing the model to an average of 4bits for the weights, ZeroQ achieves 33.7 mAP. Our results are comparable to the recent results of FQN [25], even though it is not a zeroshot quantization method (i.e., it uses the full training dataset and requires finetuning). However, it should be mentioned that ZeroQ keeps the activations to be 8bits, while FQN uses 4bit activations.
Here, we present an ablation study for the two components of ZeroQ: (i) the Distilled Data generated by Eq. 3 to help sensitivity analysis and determine activation clipping range; and (ii) the Pareto frontier method for automatic bitprecision assignment. Below we discuss the ablation study for each part separately.
In this work, all the sensitivity analysis and the activation range are computed on the Distilled Data. Here, we perform an ablation study on the effectiveness of Distilled Data as compared to using just Gaussian data. We use three different types of data sources, (i) Gaussian data with mean “0” and variance “1”, (ii) data from training dataset, (iii) our Distilled Data, as the input data to measure the sensitivity and to determine the activation range. We quantize ResNet50 and MobileNetV2 to an average of 4bit for weights and 8bit for activations, and we report results in Table IV.
For ResNet50, using training data results in 75.95% testing accuracy. With Gaussian data, the performance degrades to 75.44%. ZeroQ can alleviate the gap between Gaussian data and training data and achieves 75.80%. For more compact/efficient models such as MobileNetV2, the gap between using Gaussian data and using training data increases to 2.33%. ZeroQ can still achieve 68.83%, which is only 0.23% lower than using training data. Additional results for ResNet18, ShuffleNet and SqueezeNext are shown in Table VIII.
Method  Wbit  Abit  ResNet50  MobileNetV2 

Baseline  32  32  77.72  73.03 
Gaussian  MP  8  75.44  66.73 
Training Data  MP  8  75.95  69.06 
Distilled Data  MP  8  75.80  68.83 
Here, we perform an ablation study to show that the bit precision of the Pareto frontier method works well. To test this, we compare ZeroQ with two cases, one where we choose a bitconfiguration that corresponds to maximizing (which is opposite to the minimization that we do in ZeroQ), and one case where we use random bit precision for different layers. We denote these two methods as Inverse and Random. The results for quantizing weights to an average of 4bit and activations to 8bit are shown in Table V. We report the best and worst testing accuracy as well as the mean and variance in the results out of 20 tests. It can be seen that ZeroQ results in significantly better testing performance as compared to Inverse and Random. Another noticeable point is that the best configuration (i.e., minimum ) can outperform 0.18% than the worst case among the top20 configurations from ZeroQ, which reflects the advantage of the Pareto frontier method. Also, notice the small variance of all configurations generated by ZeroQ.
Top1 Accuracy  
Baseline  77.72  
Uniform  66.59  
Best  Worst  Mean  Var  
Random  38.98  0.10  6.86  105.8 
Inverse  0.11  0.06  0.07  3.0 
ZeroQ  75.80  75.62  75.73  2.4 
We have introduced ZeroQ, a novel posttraining quantization method that does not require any access to the training/validation data. Our approach uses a novel method to distill an input data distribution to match the statistics in the batch normalization layers of the model. We show that this Distilled Data is very effective in capturing the sensitivity of different layers of the network. Furthermore, we present a Pareto frontier method to select automatically the bitprecision configuration for mixedprecision settings. An important aspect of ZeroQ is its low computational overhead. For example, the endtoend zeroshot quantization time of ResNet50 is less than 30 seconds on an 8V100 GPU system. We extensively test ZeroQ on various datasets and models. This includes various ResNets, InceptionV3, MobileNetV2, ShuffleNet, and SqueezeNext on ImageNet, ResNet20 on Cifar10, and even RetinaNet for object detection on Microsoft COCO dataset. We consistently achieve higher accuracy with the same or smaller model size compared to previous posttraining quantization methods. All results show that ZeroQ could exceed previous zeroshot quantization methods. We have open sourced ZeroQ framework [16].
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §I.Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §II.Proceedings of the 36th International Conference on Machine Learning
, pp. 7543–7552. Cited by: (b)b, §I, §II, §IIIA, TABLE I, (a)a, §IVA.ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §II.
Comments
There are no comments yet.