DeepAI

# SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation

Quantization of deep neural networks (DNN) has been proven effective for compressing and accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the original datasets under privacy-sensitive and confidential scenarios. However, current DFQ solutions degrade accuracy, need synthetic data to calibrate networks, and are time-consuming and costly. This paper proposes an on-the-fly DFQ framework with sub-second quantization time, called SQuant, which can quantize networks on inference-only devices with low computation and memory requirements. With the theoretical analysis of the second-order information of DNN task loss, we decompose and approximate the Hessian-based optimization objective into three diagonal sub-items, which have different areas corresponding to three dimensions of weight tensor: element-wise, kernel-wise, and output channel-wise. Then, we progressively compose sub-items and propose a novel data-free optimization objective in the discrete domain, minimizing Constrained Absolute Sum of Error (or CASE in short), which surprisingly does not need any dataset and is even not aware of network architecture. We also design an efficient algorithm without back-propagation to further reduce the computation complexity of the objective solver. Finally, without fine-tuning and synthetic datasets, SQuant accelerates the data-free quantization process to a sub-second level with >30 improvement over the existing data-free post-training quantization works, with the evaluated models under 4-bit quantization. We have open-sourced the SQuant framework at https://github.com/clevercool/SQuant.

• 9 publications
• 6 publications
• 21 publications
• 4 publications
• 85 publications
• 20 publications
• 177 publications
• 33 publications
• 35 publications
04/30/2022

### ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization

Network quantization has emerged as a promising method for model compres...
12/26/2020

### Hybrid and Non-Uniform quantization methods using retro synthesis data for efficient inference

Existing quantization aware training methods attempt to compensate for t...
12/19/2018

### Fast Adjustable Threshold For Uniform Neural Network Quantization

Neural network quantization procedure is the necessary step for porting ...
08/19/2020

### Channel-wise Hessian Aware trace-Weighted Quantization of Neural Networks

Second-order information has proven to be very effective in determining ...
06/02/2022

### NIPQ: Noise Injection Pseudo Quantization for Automated DNN Optimization

The optimization of neural networks in terms of computation cost and mem...
12/30/2021

### Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks

Quantized neural networks typically require smaller memory footprints an...
11/23/2021

### HERO: Hessian-Enhanced Robust Optimization for Unifying and Improving Generalization and Quantization Performance

With the recent demand of deploying neural network models on mobile and ...

## 1 Introduction

With the widespread application of DNN, more and more DNN models are deployed on both computation-constrained and memory-constrained environments, e.g., smartphones, IoT devices, and self-driving cars. The desire for lightweight and energy-efficient DNN deployment solutions is increasing. Quantization is one of the most promising techniques to convert weights and activations to lower bit formats and simultaneously reduce computational time and memory consumption. There are two kinds of quantization: Post-training quantization (PTQ) (banner2018post; choukroun2019low; zhao2019improving; nagel2020up) and Quantization-aware training (QAT) (gupta2015deep; jacob2018quantization; wang2019learning; zhuang2021effective). QAT requires to simulate quantization in the training process, which invokes time-consuming retraining and hyper-parameter tuning. In contrast, PTQ directly quantizes well-trained models without retraining. However, they still need training datasets to calibrate (nagel2020up) quantized models but are often unavailable due to privacy and security issues, such as medical and confidential scenarios.

In contrast, data-free quantization (DFQ) has recently been presented as a promising way to quantize models without original datasets (nagel2019data; cai2020zeroq; zhang2021diversifying; xu2020generative; liu2021zero; qin2021diverse; choi2020data). From a deployment perspective, DFQ is the most attractive quantization method since we can apply it to any trained models as a black box post-processing step. However, current DFQ methods cannot achieve high accuracy and fast processing time simultaneously. Traditionally, DFQ (nagel2019data) uses rounding quantization, leading to the rounding-to-nearest strategy. Such a strategy causes significant accuracy loss, especially in low-bit settings. To bridge the accuracy gap between data-free and data-driven quantization, researchers propose a series of data-generative DFQ methods. They use gradient-based methods to generate fake datasets for trained models. With the synthetic data, they can employ a data-driven calibration and fine-tuning strategy to improve accuracy. However, data generation typically adopts the time-consuming gradient-based methods, which require multiple iterations to generate each input. For example, prior works often spend hours generating a calibration dataset and fine-tuning the network (xu2020generative; liu2021zero; zhang2021diversifying).

To solve this dilemma, we propose SQuant, a fast and accurate data-free quantization framework for convolutional neural networks, employing the constrained absolute sum of error (CASE) of weights as the rounding metric. By leveraging Hessian information of network loss due to quantization, we propose a novel diagonal Hessian approximation, which decomposes the optimization objective into three data-free sub-items: element-wise, kernel-wise, and output channel-wise, each of which corresponds to a single or a set of dimensions of the weight tensor. We progressively compose and optimize these three sub-items in the discrete space. The final approximate objective eliminates the requirement of data generation. We propose a progressive algorithm with linear complexity to solve the optimization objective, further accelerating DFQ time to a sub-second level. For example, SQuant only needs an average of 4 ms and 84 ms for quantizing a layer and the overall network of ResNet18, respectively. As it does not require back-propagation nor fine-tuning, SQuant can run on inference-only devices with limited computation and memory resources on the fly. That opens up new opportunities and scenarios for adopting quantization.

Compared with state-of-the-art DFQ methods, SQuant achieves higher accuracy on all evaluated models under the 4/6/8-bit settings. SQuant only introduces 0.1% accuracy loss on average under the 8-bit setting. Under fewer bit precisions, the advantage of SQuant further expands. SQuant only introduces 1.8% accuracy loss on average under the 6-bit setting. Under the 4-bit setting, SQuant can achieve more than 30% accuracy improvement compared with data-free PTQ methods. In a word, SQuant pushes the accuracy and processing time of DFQ to a new frontier.

## 2 Preliminaries

### 2.1 Notations

We specifically use , and to denote the input, output, and weight variables, respectively. Constant and scalar are denoted by italic letters, e.g.,

. Column vector and flattened matrix are denoted by bold lowercase letters, e.g.,

, and matrices (or tensors) are represented by uppercase letters, e.g., . The subscript and superscript can further represent the element indices and the layer of a network, respectively, e.g., .

denotes the expectation operator, and the network loss function is represented by

. For convenience in this paper, we call the row of FC (fully connected layer) weight as the output channel and the column of FC weight as the input channel, which are the counterparts to Conv (convolution layer) weight. We use , , and to denote output channel size, input channel size, and kernel height kernel width, respectively. Specifically, FC has the shape of .

### 2.2 Quantization

Most previous works adopt the rounding-to-nearest approach for quantizing deep neural networks by rounding elements to the nearest quantization grid values with a fixed-point data type. The quantization and dequantization for a quantized element can be described as , where denotes the quantization scale parameter and, and are the lower and upper thresholds for the clipping function . The operator represents the rounding-to-nearest, i.e., minimizing the mean squared error (MSE) between the quantized and the original value.

### 2.3 Hessian-Based Optimization for Neural Networks

The Hessian-based approach is one of the most promising optimizations to further improve the quantization (dong2019hawq; dong2019hawq2; nagel2020up; shen2020q; qian2020channel; wu2020dissecting; hubara2020improving; li2021brecq; yao2021hawq) and pruning (yu2021hessian) performance for DNN models. Some of those works exploit the Hessian matrix to approximate loss degradation due to the quantization perturbation of weight, , by

 E[L(X,Y,W+ΔW)−L(X,Y,W)]≈E[ΔW⋅gW+12ΔW⋅HW⋅ΔWT], (1)

where the equation comes from second-order Taylor series expansion, is the gradient and is the full network Hessian matrix w.r.t. original weight, . Since a well-trained model has already converged, the gradient term will be close to and thus can be safely ignored. However, computing is infeasible because of the large memory overhead and computation complexity. To tackle this problem, we approximate as a layer-wise Hessian matrix under the assumption of cross-layer independence (dong2017learning; nagel2020up), i.e., , where denotes Kronecker product of two matrices, is the Hessian of the task loss w.r.t. .

For the -th output channel of Conv or FC, can be approximatively simplified into output channel-wise (nagel2020up; yu2021hessian; wu2020dissecting; qian2020channel),

 HWℓm≈∇2yℓLm,m⋅xℓxℓT=lm⋅xℓxℓT, (2)

where is approximately a diagonal matrix. Then the final optimization objective is

 ΔˆWℓm,:= argminΔWℓm,:ΔWℓm,:E[HWℓm]ΔWℓm,:T (3) = argminΔWℓm,:ΔWℓm,:E[xℓxℓT]ΔWℓm,:T=argminΔWℓm,:E[(ΔWℓm,:xℓ)2], (4)

which is the MSE between the output activation produced from original and quantized weights. Each sub-problem deals with a single output channel . We will further approximate Eq. (4) to remove any input data dependency from the optimization objective in Sec. 3.2.

## 3 Methodology

### 3.1 Overview

Although we can obtain a good quantization strategy by minimizing MSE for each output channel, it is an NP-hard combinatorial optimization problem. Even approaching an acceptable local minimum requires significant effort and involves input activations without the data-free promise.

To avoid the combinatorial optimization problem and eliminate the requirement of data, we propose the SQuant framework. First, SQuant approximates Eq. (4) with three diagonal Hessian matrices corresponding to the dimensions of weight, in Sec. 3.2. Due to the quantization with a fixed-point data type, SQuant transforms the problem into a data-free optimization problem in the discrete domain. SQuant dedicates to optimizing each layer’s weight employing a flipping approach (nagel2020up) without any input activation. To achieve our proposed optimization objective, minimizing CASE (Constrained Absolute Sum of Error), SQuant progressively composes three approximate sub-items under constraint relaxation, introduced in Sec. 3.3. Finally, SQuant needs to work out a flipping set to minimize the CASE of each kernel and output channel. We design an efficient algorithm with a linear computation complexity to find a proper based on Eq. (8), in Sec. 3.4.

### 3.2 Diagonal Hessian Approximation

In this work, we propose a new approximation of the Hessian matrix to cover non-diagonal elements and decompose Eq. (4) into three sub-items that correspond to the three dimensions of the weight tensor as illustrated in Fig. 1: SQuant-E for element-wise optimization covers the diagonal elements of (H-E); SQuant-K for kernel-wise optimization covers the diagonal blocks of (H-K); SQuant-C for output channel-wise optimization covers the whole (H-C).

The can be approximated by the following equation:

 E[xℓxℓT]≈E+K+C, (5)

where ,

 K=⎡⎢ ⎢⎣k1JK⋱kNJK⎤⎥ ⎥⎦, and E=⎡⎢ ⎢ ⎢⎣e1,1⋱eN,K⎤⎥ ⎥ ⎥⎦.

In the above equations, is an all-one matrix with the dimension of (denoted as ), and is a constant value for -th output channel. is a diagonal block matrix, where represents an all-one matrix with the dimension of . The -th diagonal block corresponds to -th kernel in convolution and has its own constant value . is a diagonal matrix with the diagonal elements of , each of which is a constant value corresponding to -th element of -th kernel.

Eq. (5) provides an approximation that preserves as much information from three different levels of as possible, which we explain in Appendix A.1. The matrix catches the common component of the Hessian matrix, while the matrix reserves the individual components in the diagonal line of the Hessian matrix. In addition, we consider kernel-wise approximation for convolution layers by using matrix . For each inference, the weights of a kernel, , scan the same feature map. As a result, the corresponding

has nearly the same expectation values in the center area, with a small perturbation in the marginal area due to padding. Therefore,

as a kernel-wise approximation achieves a low approximate error for convolution. For any , we can always find a decomposition that satisfies , for which we present the decomposition method in Appendix A.2. Substituting Eq. (5) into Eq. (4) yields the following equation.

 ΔWℓm,:E[xℓxℓT]ΔWℓm,:T≈∑n,ien,iΔWℓm,n,i2+∑nknΔWℓm,n,:JKΔWℓm,n,:T+cmΔWℓm,:JNKΔWℓm,:T. (6)

### 3.3 Data-free Optimization

To achieve the data-free optimization objective, we omit the coefficients (, and ) in Eq. (6), which leads to the approximate objective in Eq. (8) optimized by our fast SQuant framework. We present the omitting process and empirically verify that the approximation does almost not inﬂuence the performance in Appendix A.2 and Appendix A.3. It can be easily found that there are no training samples needed to minimize,

 argminΔWℓm,:∑n,iΔWℓm,n,i2+∑nΔWℓm,n,:JKΔWℓm,n,:T+ΔWℓm,:JNKΔWℓm,:T (7) = argminΔWℓm,:∑n,iΔWℓm,n,i2+∑n(∑iΔWℓm,n,i)2+(∑n,iΔWℓm,n,i)2. (8)

Next, we transform the overall objective Eq. (8) in the discrete space and explain how to compose and optimize the three approximated sub-items in order. Without loss of generality, we assume all weights have been scaled with the scale parameter for .

##### Sub-item Analysis

For the element-wise item, i.e., the first item in Eq. (8), the problem is reduced to the following objective, which we call SQuant-E.

 ΔˆWℓm,:=argminΔWℓm,:∑n,iΔWℓm,n,i2=argminΔWℓm,n,i|ΔWℓm,n,i|⇔∀ΔˆWℓm,n,i, |ΔˆWℓm,n,i|≤re=0.5, (9)

SQuant-E is essentially the rounding method when . Rounding does not introduce any approximate error and has complexity for each weight element. However, as many previous works pointed out (nagel2020up), rounding-to-nearest is not optimal because it only considers the diagonal elements the matrix while ignores the rest majority elements.

For kernel-wise item (the second item in Eq. (8)), we have the following objective called SQuant-K,

 ΔˆWℓm,:=argminΔWℓm,:∑n(∑iΔWℓm,n,i)2=argminΔWℓm,n,:|∑iΔWℓm,n,i|⇔ ∀ΔˆWℓm,n,:, |∑iΔˆWℓm,n,i|≤rk=0.5, (10)

where is the Absolute Sum of Error (ASE) of each kernel-wise weight matrix in the convolution and equals 0.5 because of the discrete quantization. In other words, SQuant is based on the insight of Sum of (Signed) error instead of the accumulation of absolute (unsigned) error.

Similarly, for the output channel-wise item (the third item in Eq. (8)), we have SQuant-C,

 ΔˆWℓm,:=argminΔWℓm,:(∑n,iΔWℓm,n,i)2⇔ ∀ΔˆWℓm,:, |∑n,iΔˆWℓm,n,i|≤rc=0.5. (11)
##### Relaxation

Obviously, is against because rounding () only guarantees the upper-bound for SQuant-K. Some elements need to relax the constraint to a larger number, such as , to satisfy . Similarly, SQuant-C also needs to relax .

##### CASE Flipping

We adopt the flipping approach (nagel2020up) to minimize the ASE. Due to the discrete quantization, rounded elements can be flipped (from rounding up to rounding down and vice versa) with integer mutation. Formally, we need to work out a flipping set to satisfy the overall objective Eq. (8) by composing these three sub-items in order (SQuant-E SQuant-K SQuant-C) with constraints relaxation. After optimization, the will be

 ∀(m,n,i)∉fm, |ΔWℓm,n,i|≤0.5;∀(m,n,j)∈fm, 0.5≤|ΔWℓm,n,j|<1.0, (12)

where is the index set of flipped elements for -th output channel. Specifically, we need to flip elements, whose perturbation has the same sign as . We prove the equivalence for Eq. (10) and Eq. (11) by illustrating the transformation process to a discrete problem in Appendix B.1.

However, any elements can satisfy Eq. (10) leading to large search space. Fortunately, based on Eq. (9), SQuant-K can select specific elements with the top- largest perturbation because they will have the smallest perturbation after flipping under the constraint of SQuant-E. Therefore, we adopt the Constrained ASE (CASE) to optimize the SQuant-E&K composition via the top- perturbation algorithm, which is the only solution for minimizing the CASE proven in Appendix B.2. Obviously, SQuant-E&K&C needs to “flip” the “SQuanted” kernel after SQuant-E&K. Notice that we can only flip one element for a kernel to satisfy the constraint .

The following section will introduce an efficient SQuant algorithm with a linear computation complexity for CASE flipping.

### 3.4 On-the-Fly SQuant

##### Progressive Algorithm

We design a progressive algorithm illustrated in algorithm LABEL:alg:overall to meet our stated optimization objective, i.e., minimizing the CASE of weight perturbation. The critical insight of the progressive algorithm is to gradually calibrate the deviation from the optimal global solution introduced by the fine-grained diagonal sub-item. To calibrate the SQuant-E, SQuant-K flips certain rounded elements. After the SQuant-K calibration, SQuant-C then further flips SQuanted kernels.

We start by rounding the weight and updating its perturbation to satisfy (Line 4-5). Then we run the SQuant-K (Line 6) to flip specific elements under , satisfy , and update kernel perturbation (Line 7). The follow-up SQuant-C (Line 8) further flips specific kernels under and satisfy . Finally, we derive the quantized weights (Line 9).

algocf[h]

##### Flip Algorithm

SQuant-K and SQuant-C can utilize the same flip function. The goal of the flip algorithm is to find a proper element set to flip and minimize the CASE depicted in algorithm LABEL:alg:squant. First, we need to compute the accumulated perturbation () (Line 2). We select weights with positive perturbation to decrease the positive and vice versa for negative . Therefore, we set for the elements with a different sign (Line 3) to disable them. Obviously, we need only elements and reduce (Line 4). Finally, we flip weights with the largest (Line 5-6). For now, we have SQuanted the kernel and tuned kernel CASE to . Specifically, for FC and Conv with a kernel size of , we can skip the SQuant-K. As mentioned in Section 3.3, SQuant-C flips only one element in each kernel. Therefore, we update the kernel perturbation (Line 7 of  algorithm LABEL:alg:overall) for SQuant-C to flip kernel illustrated in Appendix B.3. As a result, SQuant successfully identifies the optimum combination under a low computation complexity, which we analyze in Appendix B.4.

algocf[h]

##### On-the-Fly Framework

From the overall perspective of the optimization, SQuant-K has sub-problems, while SQuant-C has sub-problems. Because of the independence of sub-problems, SQuant is friendly for DNN accelerators, e.g., GPU, allowing each sub-problem to be accelerated in parallel. Without the requirement of back-propagation nor fine-tuning, SQuant can run on inference-only devices with constrained computation and memory resources on the fly. That provides new opportunities for optimizing weight quantization. In the next section, we demonstrate the impressive efficiency and accuracy of SQuant.

## 4 Experiments

For demonstrating the strength of SQuant, we evaluate the SQuant as well as four SOTA methods, DFQ (nagel2019data), ZeroQ (cai2020zeroq), DSG (zhang2021diversifying; qin2021diverse), and GDFQ (xu2020generative), with 5 different CNN models including ResNet-18 & 50 (he2016deep), Inception V3 (szegedy2016rethinking), SqueezeNext (gholami2018squeezenext) and ShuffleNet (zhang2018shufflenet)

on the golden standard dataset ImageNet

(krizhevsky2012imagenet).

In our experiments, SQuant is dedicated to weight quantization, including setting quantization range and selecting the grid point with per-channel quantization, which is friendly for hardware accelerators. With the BN-based approach, we adopt a simple rounding method and a wide quantization range for activation suggested by DFQ (nagel2019data)

without breaking the data-free premise. We clip activation tensors in a layerwise manner (per-tensor). We utilize a uniform distribution as the initialization for the activation quantization range. All DFQ algorithms are implemented with PyTorch

(paszke2019pytorch) and evaluated on Nvidia GPU A100-40GB. Unless otherwise stated, we employ both weight and activation quantization in all experiments. Also, uniform quantization grids are used in all experiments, and hyper-parameters, e.g., and , for all SQuant experiments are the same.

### 4.1 Comparison to SOTA Methods

Table 2 and Table 2

show the results on the ImageNet datasets for various bit-width choices, comparing our SQuant against other data-free methods. Among these methods, ZeroQ, DSG, and GDFQ are data-generative approaches with back-propagation. The former two are PTQ methods, while the last is a QAT method, which retrains the network with the synthetic data. DFQ is the only true data-free method with weight equalization and bias correction.

Experiments show that SQuant significantly outperforms all other SOTA DFQ methods, even with synthetic dataset calibrating their networks. The 8-bit quantization preserves better network accuracy than the lower-bit quantization does because of higher precision. The benefit of SQuant becomes more prominent as the bit-width decreases. SQuant outperforms the PTQ methods, i.e., DFQ, ZeroQ, and DSG, more than 30% on all models with 4-bit quantization. It is noteworthy that SQuant surpasses GDFQ in all cases and even surpasses more than 15% in ResNet50 under 4-bit quantization, although GDFQ is a quantization-aware training method.

Table 2 and Table 2 also show that GDFQ significantly outperforms ZeroQ and DSG under lower-bit settings (e.g., 4-bit). Since we use the same activation quantization method for evaluating these methods, the results indicate that the weight quantization plays a critical role in the overall model quantization. However, GDFQ requires fine-tuning (FT) with back-propagation (BP). In contrast, SQuant adopts a direct optimization objective of weight perturbation, which does not require fine-tuning nor BP, and still outperforms GDFQ in the 4-bit setting. These results clearly illustrate the advantages of SQuant, a CASE-based optimization framework, which is to minimize the CASE of weight perturbation.

### 4.2 SQuant Efficiency

The trade-off between efficiency and accuracy is challenging for previous DFQ methods. Before SQuant, DFQ is the fastest one since it does not require back-propagation and fine-tuning, but it performs poorly, especially in low-bit cases. GDFQ performs relatively well but takes hours to complete 400 epochs that produce synthetic data from weights and fine-tune the network. SQuant employs the direct optimization objective, minimizing the CASE of weight perturbation, pushes the quantization procedure to a sub-second level. Table

3 shows the 4-bit quantization time of the five models using SQuant, ZeroQ, and GDFQ. The efficient algorithm design also contributes to the surprising results. Note that the SQuant results in Table 3 are the sum of all layer quantization time, and it will be faster if we quantize layers in parallel. A single layer takes SQuant just 3 milliseconds on average because SQuant does not involve complex algorithms, such as back-propagation and fine-tuning. That means we can implement the SQuant algorithm on inference-only devices such as smartphones and IoT devices and quantize the network on the fly.

### 4.3 Ablation study

##### SQuant Granularity

We decouple the effect of SQuant-K and SQuant-C, which have different granularities to optimize CASE. As shown in Table 5, their accuracies both outperform SQuant-E (i.e., rounding), and combining them leads to higher accuracy for ResNet18. SQuant-E&C has a lower accuracy than SQuant-E&K because SQuant-C has a more significant approximation error than SQuant-K. On the other hand, SQuant-E alone is not optimal because it uses a smaller granularity and ignores a large amount of Hessian information as we analyze in Section 3. This ablation study shows that SQuant-E&K&C achieves the best accuracy by exploiting the most Hessian information (H-C), and SQuant-E&K also achieves a higher accuracy with H-K than SQuant-E with H-E.

## 5 Related Work

Compression is a promising method to reduce the DNN model’s memory and computation cost. Pruning (han2015learning; han2015deep) is one of the effective approaches to exploit the inherent redundancy of DNN. However, pruning will cause sparse irregular memory accesses. Therefore, pruning needs software (gale2020sparse; guan2020far; qiu2019adversarial; guo2020accelerating; guan2021block; fedus2021switch) and hardware (gondimalla2019sparten; guo2020balancing; zhang2020sparch; wang202100088) optimization to accelerate.

Quantization is more practical because it can be supported directly by existing accelerators. Quantization-aware training (QAT) (gupta2015deep; jacob2018quantization; wang2019learning; zhuang2021effective) is one of the most promising techniques to retrain networks and mitigate the accuracy drop introduced by quantization. However, the training procedure is time-consuming and costly. Therefore, post-training quantization (PTQ) (banner2018post; choukroun2019low; zhao2019improving; nagel2020up) has earned lots of attention due to the absence of any fine-tuning or retraining process, at the expense of accuracy.

Recently, several methods for CNN quantization without the original training datasets have been proposed. These methods are known as data-free quantization (DFQ), including PTQ (nagel2019data; cai2020zeroq; zhang2021diversifying) and QAT (xu2020generative; liu2021zero; qin2021diverse; choi2020data). DFQ (nagel2019data) and ACIQ (nagel2019data)

rely on weight equalization or bias correction without requiring synthetic data. Other works synthesize the data to calibrate or fine-tune the network based on the batch normalization statistics

(cai2020zeroq) or adversarial knowledge distillation techniques (liu2021zero; choi2020data).

## 6 Conclusion

This paper approximates and composes the original Hessian optimization objective into the CASE of weight perturbation with a data-free premise. Surprisingly, CASE only involves the weight perturbation and requires no knowledge of any datasets or network architecture. Based on that, we proposed the on-the-fly SQuant framework. We used a progressive algorithm to minimize CASE directly and significantly improve accuracy than other DFQ methods. SQuant considerably reduces optimization complexity and accelerates the data-free quantization procedure, which previously requires back-propagation with massive computation and memory resources consumption seen in other works. In summary, SQuant outperforms other data-free quantization approaches in terms of accuracy and pushes the quantization processing time to a sub-second level.

#### Acknowledgments

We would like to thank the anonymous reviewers for their constructive feedback. This work was supported by the National Key R&D Program of China under Grant 2021ZD0110104, and the National Natural Science Foundation of China (NSFC) grant (U21B2017, 62072297, and 61832006).

## Appendix A Approximation and Decomposition

### a.1 Approximated Hessian Matrix for data-free quantization

The quantization loss function for the entire network is

 L(ΔW)=ΔWE[H]ΔWT. (13)

Consider a convolution layer defined as

 Ym,h,w=∑n,i,jWm,n,i,jXn,h−i,w−j. (14)

Here, has three dimensions, output channel, output feature map height, and output feature map width, i.e., indexing by , has four dimensions, output channel, input channel, kernel height, kernel width, i.e., indexing by , and has three dimensions, input channel, input feature map height, and input feature map width, i.e., indexing by . Ignoring the interaction between layers and output channels following nagel2020up, for a specific convolution layer and output channel , the elements of corresponding output channel-wise Hessian is

 HWℓmn,i,j,n′,i′,j′= ∂2L∂Wm,n,i,j∂Wm,n′,i′,j′ (15) = ∂∂Wm,n,i,j∑h,w∂L∂Ym,h,w∂Ym,h,w∂Wm,n,i,j (16) = ∂∂Wm,n,i,j∑h,w∂L∂Ym,h,wXn,h−i,w−j (17) = ∑h,w(∂∂Ym,h,w∂L∂Ym,h′,w′)Xn,h−i,w−j (18) = ∑h,w⎛⎝∂∂Ym,h,w∑h′,w′∂L∂Ym,h′,w′Xn′,h′−i′,w′−j′⎞⎠Xn,h−i,w−j (19) = ∑h,w∑h′,w′∂2L∂Ym,h,w∂Ym,h′,w′Xn,h−i,w−jXn′,h′−i′,w′−j′ (20)

Assuming is a diagonal matrix yields Eq. (30) in (nagel2020up)

 HWℓmn,i,j,n′,i′,j′≈ ∑h,w∂2L∂Y2m,h,wXn,h−i,w−jXn,h−i′,w−j′. (21)

To make Eq. (13) get irrelevant to training samples, we assume that input feature maps auto-correlate with each other in a similar way, resulting in

 E[HWℓmn,i,j,n′,i′,j′]= E[∑h,w∂2L∂Y2m,h,wXn,h−i,w−jXn′,h−i′,w−j′]≈cm (22)

for all and , where is a constant. It should be noted that Eq. (22) is a strong assumption. For more accurate approximation, we further look into each input channel (i.e., , , and ), and find

 E[HWℓm,ni,j,i′,j′]= E[∑h,w∂2L∂Y2m,h,wXn,h−i,w−jXn,h−i′,w−j′]≈km,n, (23)

for all , , , and , where is a constant. It is generally true because the kernel size is usually much smaller than the size of feature maps, so the shift introduced by different , , , and is a small perturbation of compared with the summation over the entire feature map. Finally, we focus on each diagonal element of Hessian matrix (i.e., , , and ) and denote

 E[HWℓm,n,i,j]= E[∑h,w∂2L∂Y2m,h,wX2n,h−i,w−j]=em,n,i,j, (24)

where is a constant. Please note that the output channel-wise expected Hessian matrix is a principle submatrix of , so it must positive semi-define. Therefore, we set to ensure the approximation to is also positive semi-define and nontrivial. Considering Eq. (22), Eq. (23), and Eq. (24) at the same time, we can get the approximation to expected Hessian shown in Eq. (5). Extending the discussion to fully connected layer is straightforward thus omitted here.

### a.2 Decomposition

In this section, we present the decomposition method for illustrated in the algorithm LABEL:alg:Hessian. First, we construct three matrices with the shape of , ,

 K′=⎡⎢ ⎢⎣JK⋱JK⎤⎥ ⎥⎦, and E′=⎡⎢ ⎢⎣1⋱1⎤⎥ ⎥⎦.

Here, is an all-one matrix with dimension and represents an all-one matrix with dimension . The -th diagonal block corresponds to -th kernel in convolution and has the same constant . is a diagonal matrix whose diagonal elements are 1.

algocf[h]     In algorithm LABEL:alg:Hessian, , and we can get the matrices , and . Evidently, algorithm LABEL:alg:Hessian can make , , and for any .

### a.3 Approximation Error Analysis

To achieve the data-free optimization objective, we omit the coefficients (, and ) in Eq. (6), which leads to the approximate objective in Eq. (8) optimized by our fast SQuant framework. We approximate Eq. (6) to Eq. (8) to enable fast data-free quantization. The approximation error is insignificant as our comprehensive results have shown the high accuracy of the final quantized model in Table 2 and Table 2 of the manuscript. The intuition behind the approximation is that we use an iterative process which progressively reduces each term of Eq. (6). Because each term’s coefficient (, , and ) is positive, the reduction of each term would generally lead to the reduction of the precise objective in Eq. (6). In this section, we provide an empirical analysis of the approximation error between Eq. (6) andEq. (8).

In this empirical experiment, we use the real dataset to generate the precise coefficients of , , and in Eq. (6). To quantify the approximation error in our SQuant framework, we evaluate a metric called approximation precision and show that we achieve a nearly 95% approximation precision.

Since SQuant uses the flipping-based iterative optimization framework to minimize Eq. (8), we define an element as correctly flipped if its flipping leads to the decrease of the precise objective Eq. (6) and approximate objective Eq. (8). The approximation precision (AP) is the ratio of the correct element based on data-free Eq. (8) compared to data-driven Eq. (6), i.e.,

 AP=Number of correct elementsNumber of flipped % elements.

We perform the above approximation error analysis on ResNet18 with ImageNet under 4-bit weight-only quantization. We evaluate the SQuanted weight on the inference datasets. We compute the coefficients , , and with 1000 samples. Table 6 shows the results, which clearly show that SQuant-E&K&C achieves nearly 100% approximation precision. In other words, nearly all flipped elements can indeed reduce the precise objective in Eq. (6). Based on this empirical study, we show that the approximation from Eq. (6) to Eq. (8) is effective for our data-free quantization.

## Appendix B Squant Algorithm

### b.1 Discrete Optimization Problem

We introduce the transformation of the discrete optimization problem. We know that the quantization is to round the scaled elements to the integer grid. Each quantization step has two rounding directions, rounding up and down, with a step size of 1. We have the quantization example within shown in Fig. 2.

Clearly, the element () can be rounded up (down) to () with () orignal perturbation and flipped to () with () flipped perturbation with () mutation. Each flipping operation leads to a integer mutation and increases the perturbation to . We prove that we can always find elements to flip and reduce .

###### Proof.

Assume -th kernel has elements, rounded up elements with index set have positive perturbation, rounded down elements with index set have negative perturbation, and . Then, we have

 |∑iΔWℓm,n,i|= ∣∣∑tΔWℓm,n,t−∑j|ΔWℓm,n,j|∣∣,t∈fa,j∈fb (25) ≤ max(∑tΔWℓm,n,t,∑j|ΔWℓm,n,j|),t∈fa,j∈fb (26)

Without loss of generality, let ,

 max(∑tΔWℓm,n,t,∑j|ΔWℓm,n,j|)= ∑tΔWℓm,n,tt∈fa,j∈fb (27) ≤ 0.5⋅a. (28)

Therefore, we can always find elements in with size of to flip down and make . For example, if , we need to flip 3 elements with positive perturbation (rounding up) to negative (rounding down) with mutation. Then, we have

 ΔˆWℓm,n:=argminΔWℓm,n,:(∑iΔWℓm,n,i)2⇒ |∑iΔˆWℓm,n,i|=∣∣k−|∑iΔWℓm,n,i|∣∣. (29)

For each kernel , has the minimum value . The sufficiency of Eq. (10) has been proven:

 ΔˆWℓm,:=argminΔWℓm,n,:(∑iΔWℓm,n,i)2⇒ ∀ΔˆWℓm,n,:, |∑iΔˆWℓm,n,i|≤rk=0.5 (30)

When and all perturbation , original have the upper bound . ∎

###### Proof.

If we flip or elements for -th kernel, the can reduce to and , respectively. Obviously,

 (31) (32)

With other numbers , we can also draw the same conclusions. Therefore, when , there is only one value, i.e., the minimum value, with flipped elements satisfy the Eq. (10). The necessity of Eq. (10) has been proven:

 argminΔWℓm,n,:(∑iΔWℓm,n,i)2⇔ ∀ΔˆWℓm,n,:, |∑iΔˆWℓm,n,i|≤rk=0.5 (33)

Similarly, we can extend all conclusions to SQuant-C. ∎

For SQuant, we only consider the flipping operation in one quantization step and select the elements whose sign is the same as because flipping with more quantization steps (e.g., flip to ) and the elements with different perturbation signs will cause a more significant perturbation and will violate the Eq. (8). We explain that in the next section.

### b.2 Proof of Tok-k Perturbation Algorithm

###### Proof.

We will prove the SQuant-E&K will lead to the top- algorithm. We have the composition SQuant-E and SQuant-K optimization objective for -th kernel,

 argminΔWℓm,n,:∑i(ΔWℓm,n,i)2+(∑iΔWℓm,n,i)2, (34)

which is the first two items of Eq. (8). Without loss of generality, we assume , then SQuant needs to flip elements with perturbation to transform to and is still constant regardless of which elements are. Therefore, the elements are only determined by the first item of Eq. (34), . We denote as the index set of the flipped elements and the original perturbation for -th kernel. Therefore, . Substituting and in Eq. (34), we have the optimization objective for after flipping elements,

 argminf∑t(|Ot|)2+∑j(1−|Oj|)2+(e−k)2,t∉f, j∈f, Oj>0 (35) = argminf∑i(|Oi|)2−∑j(|Oj|)2+∑j(1−|Oj|)2,j∈f, Oj>0 (36) = argminf∑j[(1−|Oj|)2−|Oj|2],j∈f, Oj>0 (37) = argminf∑j(1−2|Oj|),j∈f, Oj>0 (38) = argmaxf∑j(|Oj|),j∈f, Oj>0. (39)

Therefore, the Eq. (39) is essentially the top- perturbation algorithm. We can easily extend the top- algorithm in SQuant-C and design the perturbation update algorithm in  B.3.

### b.3 Perturbation Update Algorithm

SQuant-K initializes all rounded elements as flip candidates. After SQuant-K, we update the flip candidates for SQuant-C as shown in algorithm LABEL:alg:updatep based on the insight of top- perturbation algorithm ( B.2).

##### Over SQuant

First we define the situation of as “Over SQuant” (line 6). For example, if we have a kernel with , we need to SQuant it to to satisfy in with flipping elements to . Obviously, when SQuant-C needs this kernel to calibrate, the last element should be the first and the only candidate (line 7,8) to flip back to the original rounded number to make the , due to it has the largest element perturbation in the fliped elements and the smallest element perturbation after it flips back. It is vice versa for .

##### Under SQuant

For “Under SQuant” (line 9), we need to make the first un-flipped element as the flip candidate (line 10, 11) for SQuant-C, and will lead the kernel to ”Over SQuant” with absolute kernel perturbation in when SQuant-C flips this element of the kernel.

algocf[h]

Finally, each kernel has only one candidate flip element for SQuant-C to satisfy Eq. (8). In practice, it is easy to fuse the perturbation update algorithm with the flip algorithm without extra overhead.

### b.4 Complexity Analysis

The original optimization problem described by Eq. (4) is NP-hard with . Based on the SQuant approximation, the new optimization objective is to minimize CASE whose complexity is for SQuant-C, for SQuant-K and for SQuant-E. SQuant is optimized to a top- algorithm with a significant complexity reduction to for each sub-problem. Our experiments show that after SQuant-K pre-optimization, SQuant-C only requires a tiny top- number, such as , to satisfy all cases. For a kernel with 9 elements, SQuant-K only needs because the kernel CASE is always . Finally, their complexity can reduce to linear, for SQuant-C and for SQuant-K.