ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization

Network quantization has emerged as a promising method for model compression and inference acceleration. However, tradtional quantization methods (such as quantization aware training and post training quantization) require original data for the fine-tuning or calibration of quantized model, which makes them inapplicable to the cases that original data are not accessed due to privacy or security. This gives birth to the data-free quantization with synthetic data generation. While current DFQ methods still suffer from severe performance degradation when quantizing a model into lower bit, caused by the low inter-class separability of semantic features. To this end, we propose a new and effective data-free quantization method termed ClusterQ, which utilizes the semantic feature distribution alignment for synthetic data generation. To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics to imitate the distribution of real data, so that the performance degradation is alleviated. Moreover, we incorporate the intra-class variance to solve class-wise mode collapse. We also employ the exponential moving average to update the centroid of each cluster for further feature distribution improvement. Extensive experiments across various deep models (e.g., ResNet-18 and MobileNet-V2) over the ImageNet dataset demonstrate that our ClusterQ obtains state-of-the-art performance.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 6

page 8

page 9

09/01/2021

Diverse Sample Generation: Pushing the Limit of Data-free Quantization

Recently, generative data-free quantization emerges as a practical appro...
09/09/2021

Fine-grained Data Distribution Alignment for Post-Training Quantization

While post-training quantization receives popularity mostly due to its e...
02/14/2022

SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation

Quantization of deep neural networks (DNN) has been proven effective for...
03/01/2021

Diversifying Sample Generation for Accurate Data-Free Quantization

Quantization has emerged as one of the most prevalent approaches to comp...
11/19/2020

Learning in School: Multi-teacher Knowledge Inversion for Data-Free Quantization

User data confidentiality protection is becoming a rising challenge in t...
12/03/2019

The Knowledge Within: Methods for Data-Free Model Compression

Background: Recently, an extensive amount of research has been focused o...
10/30/2020

Reset band for mitigatation of quantization induced performance degradation

Reset control has emerged as a viable alternative to popular PID, capabl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural network (DNN)-based models have obtained remarkable progress on computer vision tasks due to its strong representation ability

[krizhevsky2012imagenet, he2016deep, szegedy2015going, redmon2016you, wei2021deraincyclegan]. However, DNN models usually suffer from high computational complexity and massive parameters, and large DNN models require frequent memory access, which will lead to much more energy consumption and inference latency [han2016eie]. Moreover, it is still challenging to deploy them on the edge devices due to the limited memory bandwidth, inference ability and energy consumption.

To solve aforementioned issues, massive model compression methods have emerged to improve the efficiency of DNN models, e.g., pruning [liu2018rethinking, esposito1997comparative, markel1971fft, ruan2020edp, chen2020dynamical, ioannidis2020efficient], quantization [jacob2018quantization, banner2018post, choi2018pact, courbariaux2015binaryconnect, esser2019learned, kim2020exploiting, cai2020zeroq, choi2021qimera, xu2020generative, zhong2021fine, nahshan2021loss, zhu2021autorecon, zhang2021diversifying], light-weight architecture design [sandler2018mobilenetv2, howard2019searching, ma2018shufflenet, liu2018darts], low-rank factorization [denton2014exploiting, jaderberg2014speeding, lebedev2014speeding, yu2017compressing, 9679094] and knowledge distillation [yim2017gift, cheng2020explaining, lopes2017data, wang2020real]. Different from other model compression methods, model quantization can be implemented in real-scenario model deployment, with the low-precision computation supported on general hardware. Briefly, model quantization paradigm converts the floating-point values into low-bit integers for model compression [jacob2018quantization]. As such, less memory access will be needed and computation latency will be reduced in model inference, which make it possible to deploy large DNN model on edge devices for real-time applications.

Due to the limited representation ability over low-bit values, model quantization usually involves noise, which potentially results in the performance degradation in reality. To recover the quantized model performance, Quantization Aware Training (QAT) performs backward propagation to retrain the quantized model [courbariaux2015binaryconnect, choi2018pact, esser2019learned, kim2020exploiting]. However, QAT is usually time-consuming and hard to implement, so Post Training Quantization (PTQ), as an alternative method, aims at adjusting the weights of quantized model without training [banner2018post, nahshan2021loss, zhong2021fine]. Note that QAT and PTQ need the original training data for quantization, whereas training data may be prohibited severely from access due to privacy or proprietary rules in real scenario, e.g., user data, military information, or medical images. As a result, real-world applications of QAT and PTQ may be restricted.

(a)
(b)
Fig. 1: t-SNE visualization comparison of the -th layer features of ResNet-20 [he2016deep] inferring on the CIFAR10 dataset [krizhevsky2009learning] (a), and the synthetic data generated by ZeroQ [cai2020zeroq] (b).
Fig. 2:

Overview of the proposed ClusterQ scheme. Based on the Conditional Generative Adversarial Network (CGAN)

[mirza2014conditional]

mechanism, we perform clustering and alignment on the batch normalization statistics of semantic features to obtain high inter-class separability.

Recently, Data-Free Quantization (DFQ) have came into being as a more promising method for the practical applications without access to any training data, which aims at restoring the performance of quantized model by generating synthesis data, similar to the data-free knowledge distillation [lopes2017data]. Current DFQ methods can be roughly divided into two categories, i.e., without fine-tuning and with fine-tuning. Pioneer work on DFQ without fine-tuning, like ZeroQ [cai2020zeroq], generate the calibration data that matches the batch normalization (BN) statistics of model to clip the range of activation values. However, compressed models by this way often have significant reduction in accuracy when quantizing to lower precision. In contrast, DFQ with fine-tuning applies generator to produce synthetic data and adjusts the parameters of quantized model to retain higher performance. For example, GDFQ [xu2020generative] learns a classification boundary and generates data with a Conditional Generative Adversarial Network (CGAN) mechanism [mirza2014conditional].

Although recent studies have witnessed lots of efforts on the topic of DFQ, the obtained improvements are still limited compared with PTQ, due to the existed gap between the synthetic data and real-world data. As such, how to make the generated synthetic data closer to the real-world data for fine-tuning will be a crucial issue to be solved. To close the gap, we explore the pre-trained model information at a fine-grained level. According to [li2016revisiting, wan2018rethinking], during the DNN model inferring on real data, the distributions of semantic features can be clustered for classification, i.e., inter-class separability property of semantic features. This property has also widely used in domain adaption to align the distributions of different domains. However, synthetic data generated by current DFQ methods (such as ZeroQ [cai2020zeroq]) cannot produce semantic features with high inter-class separability in the quantized model, as shown in Figure 1. Based on this phenomenon, we can hypothesize that high inter-class separability will reduce the gap between synthetic data and real-world data. Note that this property has also been explored by FDDA [zhong2021fine], which augments the calibration dataset of real data for PTQ. However, there still does not exist data-free quantization method that imitates the real data distribution with inter-class separability.

From this perspective, we will propose effective strategies to generate synthetic data to obtain features with high inter-class separability and maintain the generalization performance of the quantized model for data-free case. In summary, the major contributions of this paper are described as follows:

  1. Technically, we propose a new and effective data-free quantization scheme, termed ClusterQ, via feature distribution clustering and alignment, as shown in Figure 2. As can be seen, ClusterQ formulates the DFQ problem as a data-free domain adaption task to imitate the distribution of original data. To the best of our knowledge, ClusterQ is the first DFQ scheme to utilize feature distribution alignment with clusters.

  2. This study also reveals that high inter-class separability of the semantic features is critical for synthetic data generation, which impacts the quantized model performance directly. We quantize and fine-tune the DNN model with a novel synthetic data generation approach without any access to original data. To achieve high inter-class separability, we propose a Semantic Feature Distribution Alignment (SFDA) method, which can cluster and align the feature distribution into the centroids for close-to-reality data generation. For further performance improvement, we introduce the intra-class variance [wang2020intra] to enhance data diversity and exponential moving average (EMA) to update the cluster centroids.

  3. Based on the clustered and aligned semantic feature distributions, our ClusterQ can effectively alleviate the performance degradation, and obtain state-of-the-art results on a variety of popular deep models.

The rest of this paper is organized as follows. In Section II, we review the related work. The details of our method are elaborated in Section III. In Section IV and V, we present experiment results and analysis. The conclusion and perspective on future work are finally discussed in Section VI.

Ii Related Work

We briefly review the low-bit quantization methods that are close to our study. More details can be referred to [gholami2021survey] that provides a comprehensive overview for model quantization.

Ii-a Quantization Aware Training (QAT)

To avoid performance degradation of the quantized model, QAT is firstly proposed to retrain the quantized model [courbariaux2015binaryconnect, choi2018pact, esser2019learned, kim2020exploiting]

. With full training dataset, QAT performs floating-point forward and backward propagations on DNN models and quantizes them into low-bit after each training epoch. Thus, QAT can quantize model into extremely low precision while retaining the performance. In particular, PACT

[choi2018pact] optimizes the clipping ranges of activations during model retraining. LSQ [esser2019learned] learns step size as a model parameter and MPQ [kim2020exploiting] exploits retraining-based mix-precision quantization. However, high computational complexity of QAT will lead to restrictions on the implementation in reality.

Ii-B Post Training Quantization (PTQ)

PTQ is proposed for efficient quantization [banner2018post, nahshan2021loss, zhong2021fine]. Requiring for a small amount of training data and less computation, PTQ methods have ability to quantize models into low-bit precision with little performance degradation. In particular, [banner2018post] propose a clipping range optimization method with bias-correction and channel-wise bit-allocation for 4-bit quantization. [nahshan2021loss] explore the interactions between layers and propose layer-wise 4-bit quantization. [zhong2021fine] explore calibration dataset with synthetic data for PTQ. However, above methods require more or less original training data, and they are inapplicable for the cases without access to original data.

Ii-C Data-Free Quantization (DFQ)

For the case without original data, recent studies made great efforts on DFQ to generate the close-to-reality data for model fine-tuning or calibration [cai2020zeroq, xu2020generative, choi2021qimera, zhang2021diversifying, zhu2021autorecon]. Current DFQ methods can be roughly divided into two categories, i.e., without fine-tuning and with fine-tuning. Pioneer work on DFQ without fine-tuning, like ZeroQ [cai2020zeroq], generate the calibration data that matches the batch normalization (BN) statistics. DSG [zhang2021diversifying] discovers homogenization of synthetic data and enhances the diversity of generated data. However, these methods lead to significant reduction in accuracy when quantizing to lower precision. In contrast, DFQ with fine-tuning applies generator to produce synthetic data and adjusts the parameters of quantized model to retain higher performance. For example, GDFQ [xu2020generative] employs a Conditional Generative Adversarial Network (CGAN) [mirza2014conditional] mechanism and generates dataset for fine-tuning. AutoReCon [zhu2021autorecon] enhances the generator architecture by neural architecture search. Qimera [choi2021qimera] exploits boundary supporting samples to enhance the classification boundary, whereas it tends to lead to mode collapse and reduce the generalization ability of quantized model.

Iii ClusterQ: Semantic Feature Distribution Alignment for DFQ

For easy implementation on hardware, our ClusterQ scheme employs a symmetric uniform quantization, which maps and rounds the floating-point values of full-precision model to low-bit integers. Given a floating-point value

in a tensor

to be quantized, it can be defined as follows:

(1)

where is the quantized value, is the quantization bit width, denotes the clipping range, is the scaling factor to map floating-point value within clipping range into the range of and represents the rounding operation. For most symmetric uniform quantization, is defined by the maximum of absolute values ,i.e, , so that all of the values can be represented. Then, we can easily obtain the dequantized value as follows:

(2)

Due to the poor representation ability of limited bit width, there exists quantization error between the dequantized value and the original floating-point value , which may involve quantization noise and lead to accuracy loss.

(a) layer14
(b) layer15
(c) layer16
(d) layer17
(e) layer18
(f) layer19
Fig. 3:

t-SNE visualization results of the deep layer features in ResNet-20 model inferring on CIFAR-10. From (a) to (f) correspond to the features from

th layer to th layer. The inter-class separability is enhanced as the layer gets deeper.

To recover the quantized model performance, there exist two challenges for DFQ methods: (1) For statistic activation quantization, clipping range of activation values should be determined without access to the training data. (2) To recover the degraded performance, fine-tuning is used to adjust the weights of quantized models without training data. To solve these challenges, current DFQ methods try to generate synthetic data which are similar to the original training data. For example, GDFQ [xu2020generative] employs a CGAN-based mechanism for fake samples generation. Given a fixed original full-precision model as the discriminator, a generator is trained to produce synthetic data that are close to the original training data. More details can be referred to [xu2020generative].

However, without clustering and alignment of the semantic feature distributions, generated synthetic data used for fine-tuning the quantized model will lead to limited performance recovery. According to [li2016revisiting], traits of data domain are contained in the semantic feature distributions. The knowledge of the full-precision pre-trained model can be further used for synthetic data generation by clustering the semantic feature distributions. From our perspective, this will be the most critical factor for the performance recovery of quantized model.

To utilize the distribution of semantic features, we further exploit the Batch Normalization (BN) statistics [ioffe2015batch] to imitate the original distribution. Next, we briefly review the BN layer in DNN models, which is designed to alleviate the internal covariate shifting. Formally, with a mini-batch input of batch size , the BN layer will transfer the input into the following expression:

(3)

where and denote the input and output of BN layer respectively, and denote the parameters learned during training. After training, the distribution of input in each layer will be stable across training data.

Fig. 4: The structure of SFDA method. BN statistics in each semantic layer are class-wisely extracted, clustered and aligned to the corresponding centroids. The SDFA loss is computed to update the generator. The pseudo labels of centroids, statistics and synthetic data are represented by different colors.

Iii-a Proposed Framework

The overview of our proposed ClusterQ is presented in Figure 2, which is based on the CGAN mechanism. Specifically, ClusterQ employs the fixed full-precision model as a discriminator. The generator is trained by the loss to produce fake data to fine-tune the quantized model by computing the loss .

The loss contains for classification and global distribution information matching. More importantly, introduces the for distribution clustering and alignment to achieve inter-class separability in semantic layer. Thus, the synthetic data can imitate the distributions of real data in feature level of pre-trained model. To adapt the distribution change during generator training, we implement the dynamic centroid update by EMA. Moreover, to avoid mode collapse, we still introduce the intra-class variance loss to improve the diversity of synthetic data.

To highlight our motivation on the inter-class separability of semantic features, we conduct some pilot experiments on the DNN features to observe the dynamic transformation of this separability over different layers, as illustrated in Figure 3. As the layer getting deeper, the feature distributions are more separable and can be easily clustered or grouped. Specifically, we can easily distinguish the features of the th and th layers (see Figure 3(e) and 3(f)), while the boundaries of clusters become blurred in the th and th layers (see Figure 3(c) and 3(d)). For more shallow layers (see Figure 3(a) and 3(b)), almost no boundary exists.

Based on high inter-class separability of semantic features, and we can model the semantic feature distribution as a Gaussian distribution

[banner2018post]. That is, the semantic feature statistics for different classes will also be clustered into groups. As such, we directly utilize the Batch Normalization statistics that save running statistics for feature clustering and alignment.

The structure of SFDA is shown in Figure 4. In the fine-tuning process of quantized model, the running BN statistics corresponding to the given pseudo labels are extracted and aligned to the centroids in each layer. The distance between running statistics and centroids is computed to update the generator . The SFDA process is elaborated below.

  1. First, after the generator warms up, with a given pre-trained full-precision model, we initialize the centroids for each class in each semantic layer. Note that the warm-up process is prerequisite for the centroids initialization to generate the synthetic data with diversity. To initialize the centroids, we pass the pseudo label of each class to the generator, infer full-precision model on the synthetic data and extract the corresponding BN statistics in each semantic layer.

  2. Then, we formulate the problem as a domain adaption task, and treat the centroids and running BN statistics as target distribution and source distribution. As such, we perform distribution alignment in each semantic layer. The Euclidean distance between running BN statistics and centroids can be calculated by the following SFDA loss function

    to align them:

    (4)

    where and

    are mean and standard deviation for class

    at the th layer in the full-precision model computed in the process of generator training, and represent the corresponding mean and standard deviation of the centroids, respectively. denotes the starting layer that contains semantic features. And denotes the number of classes. To avoid imbalance among categories caused by the random labels, we traverse all categories by employing the pseudo labels, and compute the SFDA loss independently.

Specifically, according to our experiment results, the SFDA process can significantly promote the generator to produce synthetic data with high inter-class separability of semantic features. During the fine-tuning process, the learned classification boundary will be further enhanced. In addition, to avoid misclassification caused by the pre-trained model, or the gap between synthetic data and real data, we discard the BN statistics obtained by misclassified synthetic data during the generator training process.

Iii-B Centroids Updating

The initialization of centroids may be unstable for SFDA. First, the initialization of centroids is based on the assumption that the semantic feature distributions obtained by synthetic data and real data are close. However, due to the intrinsic limitation of generator, even if the generator has been warming up, there still remains a gap to the real data which may lead to centroids mismatch and limit further distribution alignment. Specifically, the inter-class separability may be more obvious along with further generator training, and the original centroids will be no longer appropriate to the situation.

For these reasons, we need to update the centroids during generator training to release the negative effects. Thus, we update the centroids by the running BN statistics during generator training. Considering the SFDA method as a clustering method, we apply exponential moving average (EMA) directly on it to update the centroids as follows:

(5)

where and denote the running mean and standard deviation corresponding to class , respectively. is the decay rate of EMA, which trades off the importance of previous and current BN statistics. Thus, BN centroids can make the SFDA process a grouping method with decentralization property. We will provide experimental results to demonstrate the performance promotion via centroids updating.

Iii-C Intra-Class Variance

Fig. 5: The effect of Intra-Class Variance. With introduction of intra-class variance loss , the BN statistic distribution is allowed to shift around the centroids and follow Gaussian distribution. As result, the mode collapse is mitigated in data generation.

Although our proposed ClusterQ can obtain high inter-class separability of semantic features, the distribution alignment may also cause vulnerability of mode collapse which will also degrade the generalization performance of quantized model. That is, the distribution of real data cannot be covered by the synthetic data. For example, given Gaussian input, some generators produce data in fixed mode.

To expand mode coverage, we employ a simple method following [zhong2021fine] to shift the BN statistic distribution around the cluster. Specifically, due to the semantic feature distribution approximately following Gaussian distribution, we introduce Gaussian noise to increase the intra-class discrepancy within clusters and define the intra-class variance loss as

(6)

where denotes Gaussian noise, and denote the distortion levels to control intra-class variance. In this way, we can allow the running mean and standard deviation for each class to shift within a dynamic range around the cluster centroids and respectively. As shown in Figure 5, semantic feature distribution space cannot be covered without intra-class variance, therefore generated data will encounter mode collapse and lead to poor performance. In contrast, diversity images can be produced with introduction of intra-class variance loss . Experiments have verified the effect of intra-class variance loss to mitigate the mode collapse in synthetic data generation.

Iii-D Training Process

For better understanding of our quantization scheme, we summarize the whole training process in Algorithm 1. With the low-bit model quantized by Eq.(1) and the full-precision model as discriminator, our ClusterQ scheme trains the generator to produce synthetic data and updates the parameters of the quantized model alternately. Note that our implementation is based on the framework of GDFQ [xu2020generative].

At the beginning of the generator training, i.e., warm-up process, we fix the weights and BN statistics of quantized model to avoid being updated, because the generated synthetic data lack of diversity and textures. The loss function is denoted as follows:

(7)

where is a trade-off parameter. The term utilizes cross-entropy loss function with given Gaussian noise and pseudo labels to update the generator :

(8)

And the term denotes the loss to match BN statistics in each layer, denoted as follows:

(9)

where and are the running mean and standard deviation in the th layer, while and are original mean and standard deviation stored in BN layer at the th layer of full-precision model . Note that is totally different from the SFDA loss , even if they look somewhat similar.

After finishing the warm-up process, we utilize the synthetic data to fine-tune the quantized model, and initialize the BN statistic centroids. Then, the SFDA loss and the intra-class variance loss will be added into the loss function for generator training, formulated as

(10)

where and is trade-off parameters. After that, the centroids will be updated by EMA.

To fine-tune the quantized model , we use the following loss function :

(11)

where is a trade-off parameter. With the synthetic data and corresponding pseudo label , term utilizes the cross-entropy loss function to update the parameters of quantized model as follows:

(12)

And the knowledge distillation loss function

via Kullback-Leibler divergence loss

is employed to compare the outputs of quantized model and full-precision model , which is formulated as follows:

(13)

Input: Generator with random initialization, pre-trained full-precision model .
Parameter: Number of training epoch , number of warm-up epoch and number of fine-tuning step
Output: Trained generator and quantized model .

1:  Quantize and obtain the quantized model .
2:  Fix BN statistics of quantized model .
3:  for epoch 1 to  do
4:     if epoch  then
5:        Train generator with in Eq.(7).
6:     else
7:        if epoch  then
8:           Initialize the centroids.
9:        else
10:           for step 1 to  do
11:              Generate synthetic data with Gaussian noise and pseudo labels .
12:              Train generator with in Eq.(10).
13:              Update the centroids with EMA in Eq.(5).
14:              Fine-tune with in Eq.(11).
15:           end for
16:        end if
17:     end if
18:  end for
Algorithm 1 ClusterQ Training

Note the parameters of full-precision model are fixed during the whole training process to avoid modification.

Iv Experiments

Iv-a Experimental Setting

We compare each method on several popular datasets, including CIFAR10, CIFAR100 [krizhevsky2009learning] and ImageNet (ILSVRC12) [deng2009imagenet]. With thousand images of pixels , CIFAR10 and CIFAR100 datasets contain categories for classification. ImageNet has categories for classification with million training images and thousand images for validation.

For experiments, we perform quantization on ResNet-18 [he2016deep], MobileNet-V2 [sandler2018mobilenetv2]

on ImageNet, and also ResNet-20 on CIFAR10 and CIFAR100. All experiments are conducted on an NVIDIA RTX 2080Ti GPU with PyTorch

[paszke2017automatic]. Note that all of the pre-trained model implementations and weights are provided by Pytorchcv111Computer vision models on PyTorch: https://pypi.org/project/pytorchcv/.

For implementation, we follow some hyperparameter settings of GDFQ

[xu2020generative]. The number of training epoch is set to and the number of fine-tuning epoch is set to . We set epochs for the warm-up process and the rest epochs to update generator and quantized model alternately. For the trade-off parameters in Eqs.(10) and (10), we set for , for , for and for . For EMA, we set the decay rate to . In , the distortion levels of Gaussian noise and are set to and , respectively. For the sake of implementation on hardware, we choose the fixed precision quantization for experiments.

Iv-B Comparison Results

DNN Model Precision Quantization Method Top1 Accuracy
ResNet-18 W32A32 Baseline 71.470%
W4A4 ZeroQ 20.770%
GDFQ 60.704%
DSG 34.530%
Qimera 63.840%
AutoReCon 61.600%
Ours 64.390%
W4A8 ZeroQ 51.176%
GDFQ 64.810%
Qimera 65.784%
Ours 67.826%
W8A8 GDFQ 70.788%
Qimera 70.664%
Ours 70.838%
MobileNet-V2 W32A32 Baseline 73.084%
W4A4 ZeroQ 10.990%
GDFQ 59.404%
Qimera 61.620%
AutoReCon 60.020%
Ours 63.328%
W4A8 ZeroQ 13.955%
GDFQ 64.402%
Qimera 66.486%
Ours 68.200%
W8A8 GDFQ 72.814%
Qimera 72.772%
Ours 72.82%
TABLE I: Comparison result of each method on ImageNet dataset.
DNN Model Precision Quantization Method Top1 Accuracy
ResNet-20 W32A32 Baseline 70.33%
W4A4 ZeroQ 45.20%
GDFQ 63.91%
Qimera 65.10%
Ours 67.09%
W4A8 ZeroQ 58.606%
GDFQ 67.33%
Qimera 68.89%
Ours 69.68%
W8A8 ZeroQ 70.128%
GDFQ 70.39%
Qimera 70.40%
Ours 70.43%
TABLE II: Comparison results on CIFAR100 dataset.
DNN Model Precision Quantization Method Top1 Accuracy
ResNet-20 W32A32 Baseline 93.89%
W4A4 ZeroQ 73.53%
GDFQ 86.23%
Qimera 91.23%
Ours 92.06%
W4A8 ZeroQ 90.845%
GDFQ 93.74%
Qimera 93.63%
Ours 93.84%
W8A8 ZeroQ 93.94%
GDFQ 93.98%
Qimera 93.93%
Ours 94.07%
TABLE III: Comparison results on CIFAR10 dataset.

To demonstrate the performance of our ClusterQ, we compare it with several closely-related methods, i.e., ZeroQ [cai2020zeroq], GDFQ [xu2020generative], Qimera [choi2021qimera], DSG [zhang2021diversifying] and AutoReCon [zhu2021autorecon]. The comparison results based on ImageNet, CIFAR100 and CIFAR10 are described in Tables I, II and III, respectively. Note that WA stands for the quantization bit-width with -bit weight and -bit activation. The baseline with W32A32 denotes the full-precision model accuracy. The character means that the result is obtained by ourselves. By considering the practical applications, we also conduct quantization experiments with different precision settings. Moreover, we choose the bit number with power of two in all experiments for facilitating the deployment.

Iv-B1 Results on ImageNet

As can be seen in Table I, with the same precision setting based on the ResNet-18 and MobileNet-V2, our method performs better than its competitors. Specifically, our method performs beyond the most closely-related GDFQ method a lot, especially for the case of lower precision. By comparing with the current state-of-the-art method Qimera, our method still outperforms it for MobileNet-V2 that is, in fact, more difficult to be compressed due to smaller weights. One can also note that, with the reduction of precision bits, the presentation ability of the quantized value becomes limited and leads to more performance degradation. In this case, our ClusterQ retains the performance of quantized model better than other compared competitors.

Iv-B2 Results on CIFAR10 and CIFAR100

From the results in Tables II and III based on ResNet-20, similar conclusions can be obtained. That is, our method surpasses the current state-of-the-art methods in terms of accuracy loss in this investigated case. In other words, the generalization performance of our method on different models and datasets can be verified.

Iv-C Visual Analysis

Fig. 6: Synthetic data generated by the pre-trained ResNet-20 model on CIFAR10 dataset. Each row denotes different classes, except for ZeroQ, since it generates data without labels.
Fig. 7: Randomly selected synthetic data (label=”ship”) with the pre-trained ResNet-20 model on CIFAR10 dataset. Note that “w/o ” denotes the results from ClusterQ without intra-class variance loss .

In addition to the above numerical results, we also would like to perform the visual analysis on the generated synthetic data, which will directly impact the performance recovery of each quantized model. In Figure 6, we visualize the synthetic data with labels generated by existing methods (i.e., ZeroQ, GDFQ and Qimera) based on the ResNet-20 over CIFAR10. We select the synthetic data with label ”ship” as an example and show the results in Figure 7.

As shown in Figure 6, due to lack of label information, the data generated by ZeroQ have less class-wise discrepancy. For GDFQ, the generated data can be distinguished into different classes, but containing less detailed textures. Based on the SFDA, our ClusterQ can produce the synthetic data with more useful information. With abundant color and texture, the data generated by Qimera are similar to that of ours. However, as shown in Figure.7, the little variance of the images within each class indicates that they encounter class-wise mode collapse . In contrast, by simultaneously considering the contribution of intra-class variance, the generated synthetic data of the same class by ClusterQ can maintain variety on color, texture and structure. To illustrate the effect of intra-class variance, in Figure.7 we also visual the synthetic data produced by ClusterQ without which lead to class-wise mode collapse.

Model EMA Top1
ResNet-18 64.390%
- 63.646%
- 63.590%
- - 63.068%
TABLE IV: Ablation study results of ResNet-18 on the ImageNet dataset with the precision of W4A4.
Fig. 8: Sensitivity analysis of the decay rate of EMA for centroid updating. We conduct the experiments by quantizing ResNet-18 on ImageNet dataset. The quantized model performs the best at the point of .
Fig. 9: Sensitivity analysis of the for intra-class variance. We conduct the experiments by quantizing ResNet-18 on ImageNet dataset. As goes up to , performance of quantized model will increase. But the performance of quantized model falls down while goes above .

Iv-D Ablation Studies

We first evaluate the effectiveness of each component in our ClusterQ, i.e., intra-class variance and EMA. We conduct experiments to quantize the ResNet-18 into W4A4 on ImageNet dataset, and describe the results in Table IV. We see that without the intra-class variance or EMA, the performance improvement of quantized model is limited. That is, both intra-class variance or EMA are important for our method.

Then, we also analyze the sensitivity of our method to the decay rate in Figure 8. According to III-B, we set the decay rate to control the centroid updating and trade It is clear that the quantized model achieves the best result, when equals to . The performance is reduced when the decay rate is lower than , since in such cases the centroids cannot adapt to the distribution changing. Moreover, if is increased beyond , the centroids will fluctuate. The above situations lead to performance degradation.

In addition, to explore the effect of the trade-off parameter , we conduct a series of experiments with different settings of . As shown in Figure 9, when goes up to , the performance of quantized model will increase. It demonstrates that intra-class variance can improve the quality of synthetic data and lead to performance promotion. However, the performance of quantized model falls down, when goes above . Higher trade-off hyperparameter will enhance the effect of and broke the classification boundary. In summary, we should set with consideration of model representation ability and the distribution of original dataset.

V Discussion

V-a On Prior Information

It may be easy to misunderstand that our proposed ClusterQ method depends on the prior information that are provided by the pseudo labels. As such, we want to clarify the classification labels are presented as one-hot vectors and described by the class indices during the whole quantization process. Thus, the only thing our framework needs is the number of classes rather than specific classes. In fact, the number of classes can be obtained by the dimension of the weights in the last layer, even if we have no idea about the class information of dataset. Then, we can create the pseudo labels with class indices and compute the loss function with the output.

V-B About Privacy and Secrecy

Prohibition of access to the original data is one of the most important motivations for DFQ methods. Someone may worry the generator-based mechanism or by synthetic data generation will violate the privacy. However, in fact, due to the black box computing manner of deep learning and the limitation of current intelligent technologies, the synthetic images generated by our method still cannot be interpreted by human beings, as shown in Figures

6 and 7.

V-C Limitations of our ClusterQ

The proposed scheme utilizes the property of class-wise separability of feature distribution and performs class-wise statistic alignment by CGAN-like mechanism to improve the diversity of synthetic data. However, compared with those methods without fine-tuning, such as ZeroQ, generator-based methods always require for time and computation resources to train the generator. What’s more, for different computer vision tasks, we have to design new generator with the embedding capability of the corresponding label format. For deep models without BN layer, e.g., ZeroDCE [guo2020zero], generative DFQ method can not distill the statistics directly from pre-trained model.

Vi Conclusion

We have investigated the problem of alleviating the performance degradation when quantizing a model, by enhancing the inter-class separability of semantic features. Technically, a new and effective data-free quantization method referred to as ClusterQ is proposed. The setting of ClusterQ presents a new semantic feature distribution alignment for synthetic data generation, which can obtain high class-wise separability and enhance the diversity of the generated synthetic data. To further improve the feature distribution and the performance of data-free quantization, we also incorporate the ideas of intra-class variance and exponential moving average, so that the feature distributions are more accurate. Extensive experiments based on different DNN models and datasets demonstrate that our method can achieve state-of-the-art performance among current data-free quantization methods, especially for smaller network architectures. In future work, we will focus on exploring how to extend our ClusterQ to other vision tasks. The deployment of our proposed data-free quantization method into edge devices will also be investigated.

References