FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

02/15/2021 ∙ by Chaofan Tao, et al. ∙ 0

Learning convolutional neural networks (CNNs) with low bitwidth is challenging because performance may drop significantly after quantization. Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e.g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have a large discrepancy. This work presents a novel quantization pipeline, Frequency-Aware Transformation (FAT), which has several appealing benefits. (1) Rather than designing complicated quantizers like existing works, FAT learns to transform network weights in the frequency domain before quantization, making them more amenable to training in low bitwidth. (2) With FAT, CNNs can be easily trained in low precision using simple standard quantizers without tedious hyper-parameter tuning. Theoretical analysis shows that FAT improves both uniform and non-uniform quantizers. (3) FAT can be easily plugged into many CNN architectures. When training ResNet-18 and MobileNet-V2 in 4 bits, FAT plus a simple rounding operation already achieves 70.5 outperforming recent state-of-the-art by reducing 54.9X and 45.7X computations against full-precision models. We hope FAT provides a novel perspective for model quantization. Code is available at <https://github.com/ChaofanTao/FAT_Quantization>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) have exhibited impressive capabilities in various real-world applications. However, such amazing performance usually comes at the expense of a humongous amount of storage and computation. For example, ResNet-101 [12] entails 44.6M parameters and over

M multiply-accumulate (MAC) operations per image in ImageNet 

[30]. These requirements hinder mobile applications (e.g., on cell phones and robots) where devices have restrictive storage, power and computing resources. In order to compress CNNs, pruning [14, 19] and distillation [34, 33] attempt to learn a compact model with fewer weights than the original model. Differently, quantization methods shrink the bitwidth of data by replacing float values with finite-bitwidth integers, thus reducing memory footprint and simplifying computational operations.

Previous approaches [43, 17, 11, 38, 8] generally model the task of quantization as an error minimization problem, viz. where is the weight and the quantizer. As shown in Figure 1

, to minimize the quantization error, complicated quantizers are trained involving mixed-precision across layers or channels, adaptive quantization levels or learnable training policy, e.g. reinforcement learning-based policy. It has three potential problems: 1) The solution spaces of full-precision and quantized models are quite different (continuous vs discrete), especially for low bitwidths quantized models. The quantized model has very limited capacity to represent its weights. Simply pushing full-precision values to their quantized representations is sub-optimal. Ref. 

[26]

shows that flipping the signs of weights moderately during binarization leads to better performance compared with vanilla binarization; 2) Weights are correlated with each other in each CNN filter, while previous methods quantize each weight independently without exploring such relationship; 3) The quantized weight

has a null gradient

versus their float counterparts. Straight-through estimator (STE) 

[4] and its soft variant DSQ [10]

are heuristics to approximate gradients, yet how to estimate accurate gradients is still an open challenge.

This work poses a question: Instead of learning complicated quantizers to fit the full-precision parameters, can we generate quantization-friendly representations to fit a basic quantizer? We propose to decompose quantization as a representation transform and a standard quantizer such as a uniform quantizer. Before sending to the quantizer, the weight parameters are transformed

. The transform works to bridge the capacity gap between full-precision and low-bitwidth model by suppressing redundant information and retaining useful ones. By carefully defining the transform, we can jointly quantize parameters by exploring the relationships among neurons and obtain informative gradients.

In practice, we use the transform

to map the original weights to the frequency domain, emphasize important frequency components and then map the weights back to the spatial domain. Powered by Discrete Fourier Transform (DFT), any element in the frequency domain associate all elements of weights in the spatial domain. Therefore, the transform analyses the weights holistically in the frequency domain, rather than treating each parameter separately. By learning a mask over the frequency map of weights, the transform selectively retains informative frequencies, while masking off trivial frequencies from flowing into a restrictive low-bitwidth model. After then, a standard quantizer (e.g., uniform or logarithmic quantizer) is used to quantize parameters to a prescribed bitwidth.

In backpropagation, the discretization gradient

becomes controllable by the explicitly defined transform. For simplicity, we employ the same quantizer for both weight and activation, while not transforming activation (which does not relate to the capacity of the neural network). After training, the weight transform is removed. Deployment becomes as simple as using a standard quantizer. Hence, no special hardware [15] that supports advanced quantizers (e.g., mixed-precision) are required.

Figure 1: Comparison between previous quantization methods (top) with FAT (bottom). The proposed method decouples a quantization problem with a transform and a standard quantizer. The transform suppresses trivial components while keeping informative ones, and learns the quantization gradient by exploiting relationships among neurons. FAT does not involve mixed-precision, adaptive quantization levels or learnable training policy. During inference, only a simple and standard quantizer is required.

We further provide theoretical analyses about the properties of our model from a frequency perspective. The proposed Frequency-Aware Transformation (FAT) enables not only quantization error reduction, but also informative discretization gradient by jointly considering multiple frequencies. The main contributions of this work are fourfold:

  1. To the best of our knowledge, this is the first work that models the task of quantization via a representation transform and a standard quantizer. The proposed transform is an easy drop-in. We combine the transform with uniform/logarithmic quantizer in this paper.

  2. Powered by Fourier Transform, we introduce a novel spectral transform to generate quantization-friendly representations. The discretization gradient is enriched by exploring relationships between neurons, rather than quantizing neurons separately.

  3. We theoretically analyse properties of the proposed transform. It deepens our understanding of quantization from a frequency-domain viewpoint.

  4. We outperform state-of-the-art methods on CIFAR-10 and ImageNet datasets with higher computation reduction, pushing both weight and activation to INT3 against performance of full-precision models. The model is also deployed on an ARM-based mobile board.

2 Related Work

2.1 Quantization Methods

Post-Training Quantization Post-training quantization [3, 1, 24, 23] needs no training but a subset of dataset for calibrating the quantization parameters, including the clipping threshold and bias correction. Commonly, the quantization parameters of both weight and activation are decided before inference. Currently, post-training quantization methods cannot achieve satisfying performance when the allowed bitwidth goes smaller, since the post-training quantization errors are accumulated layer by layer [32]. Instead, quantization-aware training enables the model to adapt itself to a low-bitwidth setting.

Quantization-Aware Training Quantization-aware training [7, 43, 10, 26] generally focuses on minimizing the gaps between the quantized parameters and the corresponding full-precision ones. In [27, 44, 22, 9], the scaling factors for quantized parameters make the approximation of full-precision parameters more accurate. In [37], the weights and activations are quantized separately in a two-step strategy. Mixed-precision is widely employed to achieve smaller quantization errors, such as LQ-Net [43], DJPQ [38] and HMQ [11]. In HAQ [36], the training policy is learned by reinforcement learning. Given a bitwidth, adaptive non-uniform quantization intervals [5] can also reduce the quantization errors.

In addition, the non-differentiable quantization function leads to zero-gradient problem during training. Although STE [4] can be employed, the approximation error is large when the bitwidth is low. DSQ [10] uses a series of hyperbolic tangent functions to gradually approach the staircase function. Despite this, the aforementioned methods process quantization of weights independently. Our proposed FAT tackles quantization with an explicit transform. It enables joint quantization from a holistic view during feed-forward, and informative gradient during backpropagation.

2.2 Learning in the Frequency Domain

Standard convolution in the spatial domain is mathematically equivalent with the cheaper Hadamard product in the frequency domain. Therefore, Fast Fourier Transform (FFT) has been widely used to speed up convolution [21, 29, 18], and design energy-efficient CNN hardware [25, 6]. CNNpack [39] regards convolutional filters as images and then decomposes convolution in the frequency domain for speedup.

Moreover, FFT allows one to extract salient information of feature maps from the frequency domain, which provides additional cues besides visual features. Ref. [42] replaces vanilla downsampling with frequency map of input data, which works as a new form of data pre-processing. In [28], spectral pooling is proposed, which does pooling in the frequency domain and preserves more information than the regular pooling done in the spatial domain. In [41], the authors propose a novel invertible network to do image rescaling task by embedding lost high-frequency information in the downscaling direction. While most of these methods focus on efficient convolution or processing of data/features maps, our proposed FAT is the first to leverage frequency properties to learn quantization-friendly representations in the weight space, as a bridge to achieve low-bitwidth models.

3 Frequency-Aware Transformation (FAT)

3.1 Preliminary

We introduce two standard quantizers, uniform quantizer and logarithmic quantizer. Traditionally, if the bitwidth is fixed without learning for all layers, a quantizer is determined by the quantization levels and clip threshold. A full-precision value is clipped by a float threshold and then projected onto the range . Using unsigned quantization as an example, uniform quantization is formulated as:

(1)

where clipping mitigates the negative effect of extreme values. For uniform quantization, it has quantization levels . The interval between quantization levels is fixed to be , therefore the same conventional adders and multipliers can be employed during deployment. For logarithmic quantization, it has quantization levels satisfying powers of two . The full-precision value is mapped to its nearest quantization level to get quantized value. Although logarithmic quantizer involves different multipliers for different quantization levels, they can be obtained with cheap shift operations. Signed quantization can be extended straightforwardly by deceasing by 1 bit and symmetrizing the quantization level.

3.2 FAT Framework

Compared with existing approaches that focus on how to design quantizers to fit the full-precision weights, our proposed FAT attempts to generate quantization-friendly representations via a spectral transform. Since the capacity of low-bitwidth model is more restrictive, a good representation should make full utilization of each bit. The transform is supposed to keep the salient information while disregarding unimportant cues. To unify the operation for both convolution and fully-connected layers, we reshape a CNN kernel tensor

to a -D matrix denoted as , where . Each row denotes a filter. Since different filters are computed separately in convolution, we apply 1-D Discrete Fourier Transform (DFT) on each filter to obtain frequency map . The process of DFT is formalized as:

(2)

where filter index and neuron index in filter . By this operation, we encode each convolutional filter with frequency basis functions. DFT generates complex values instead of real, so we compute the spectral norms of weights at each frequency. The spectral norm reflects the energy on each frequency. As shown in Figure 2, compared with weights in the spatial domain, the energy distribution is much sparser in the frequency domain. Regardless of any filters, the energy in both low and high frequencies are strong, while the energy in middle frequencies is weak. This fact is generally observed in different layers and different network architectures used in the Experiments section. It inspires us to learn a soft mask that automatically learns importance from the frequency map, suppressing redundant information flow into low-bitwidth models. Then, a simple quantizer can be applied regardless of layers and architectures.

Figure 2: Visualization of two flattened weights in spatial domain (top) and frequency domain (bottom). The warmer the color, the higher the value. The density is randomly distributed in the spatial domain, while sparsely distributed on different frequencies. We then use a soft mask to learn the importance of frequency map of weights in an element-wise manner.
Figure 3: Illustration of the proposed quantization process. “” and “A” stand for weight and activation. is a standard quantizer. We use and to denote Hadamard product and convolution operation, respectively.

The complete workflow is visualized in Figure 3. We use a trainable mask to find and distinguish the quantization-friendly and quantization-useless components in the weights, and the useless components will be softly deactivated with coefficient ranging from to . This step can be formalized by the following equations:

(3)
(4)

where

linearly maps the power spectral map. Sigmoid function ensures all coefficients learned in the mask are in

. And represents the element-wise multiplication. The learning process of the mask can be viewed as a self-attention mechanism in the frequency domain. The mask retains useful frequency components and ignores trivial components. Finally, we map the weights back to the spatial domain:

(5)

where is the inverse Discrete Fourier Transform (iDFT). The transformed weights will be quantized by a standard quantizer. Then, convolutional weights will be reshaped to 4-D for convolutional operation.

Figure 4: Illustration of the process of updating gradient for flattened convolution weights during transform.
  Input: weight and activation , bitwidth ;
  Output: quantized output
  Parameters: soft mask , threshold ,
  Feed Forward:
  , map the full-precision weight from spatial domain to frequency domain;
  , where is generated by Eq. 3 to adjust the passing proportion in different frequency bases;
  , where is a standard quantizer applied on the transformed weight ;
  ;
  ;
  Backward Propagation:
   = ;
   = , where the soft mask learns different frequency clues jointly to update;
   = , where the discretization function learns different frequency clues jointly to update;
   = ;
   = ; // During inference, the transform is removed, the quantized model only uses an uniform/logarithmic quantizer.
Algorithm 1 The forward and backward processes of FAT applied on one convolutional layer.

3.3 Analyses of the Proposed Framework

We provide theoretical insights showing FAT enables smaller quantization errors and more informative backpropagation gradient via a spectral transform that incorporates structural properties of various frequencies. We also explore the relationship of our approach with various representative schemes. The detailed proof of theorems and derivations of all gradients involved in FAT are detailed in the Appendix.

(a) STE
(b) DSQ
(c) FAT (Ours)
Figure 5: Top: Comparison of discretization gradient matrix in STE, DSQ [10] and ours, given any filter (take the number of neurons as example). Bottom: The designed quantization approximation functions. The derivatives of these functions are the corresponding discretization gradients. For each neuron, the derivative is equal to the column sum of the top gradient matrix. Instead of quantizing each neuron independently in STE or DSQ (diagonal gradient matrix), our method considers all neurons in quantization and utilizes information of all frequencies.

3.3.1 Quantization Error Reduction

Quantization error can directly reflect the quality of quantization. Following [1], we employ the expected mean-squared-error (MSE) between the full-precision weights and its quantized version as the metric to evaluate the quantization error. In this section, we theoretically show that the transformed weight is guaranteed to have a smaller quantization error than the original weight . Without loss of generality, weight satisfies zero-mean distribution. As proven in Theorem 1,

Theorem 1.

Assume weight satisfies distribution. Then the following two inequalities hold for both uniform and logarithmic quantization:

(6a)
(6b)

We show that the expected MSE between any given full-precision weights and its quantized version can be generally approximated as the function of the clipping threshold and the amplitude . For uniform quantization and logarithmic quantization, the quantization error can be analytically written as:

(7)
(8)
(9)

where is the bitwidth, and represent uniform and logarithmic quantization, respectively.

From Eq. 7 and 8, the quantization errors are formed by two terms, namely, quantization noise and clipping noise. The clipping noise term is approximately the same for both and . In the proposed FAT, the mask helps tighten the weights towards zero. Alternatively speaking, applying the mask on leads to the result that

. Since the data range of weights shrinks after transform, the quantization resolution increases for all transformed weights. On the other hand, the informative components of full-precision weights are kept by passing important frequencies in the mask. The detailed proof is available in the Appendix. We also show that simply adjusting the standard deviation to tighten the weights does not work, since important components are not guaranteed to pass to the quantized model this way.

3.3.2 Informative Discretization gradient

Standard quantization is not differentiable and therefore does not readily allow backpropagation. In order to learn a good quantizer, we should carefully design the discretization gradient . The gradient should not only be computable, but also retain information during quantization. Before analysing our discretization gradient and make comparisons with former approaches, we show how to compute the gradient of transform. As shown in Figure 4, given

-th vectorized filter

and its transformed version , the gradient for filter is a matrix . Given -th neuron in the -th filter, the column in gradient tensor reflects the effects of all the transformed neurons on the original neuron . Therefore, this whole gradient tensor is summed along the -axis (column) to obtain the same shape as the original weight, which is used for weight updating during backpropagation.

We visualize the gradient matrix and corresponding approximation function in Figure 5. STE [4] is a rough approximation for discretization. It defines

. It assigns the gradients of all the values within clip threshold as 1, and outliers as 0. The antiderivative function

of the aforementioned gradient is . DSQ [10] approximates the gradient by a set of hyperbolic tangent functions , where , are normalization factor and handcrafted coefficient, respectively. The discretization gradient is the derivative of these hyperbolic tangent functions. Note the functions used in DSQ is different from ours from two sides: 1) DSQ just modifies the gradient during backward pass, but does not affect quantization results during forward pass, 2) DSQ treats each neuron separately.

With the bridge of our proposed transform, the discretization gradient becomes . The first term is approximated with STE, and we can utilize the transform to adjust the second term during training. Given a convolutional filter with index , the gradient of the transformed neuron to original neuron during backpropagation considers clues from all frequency bases,

(10)

where and . The soft mask in our transform is used to learn the importance of weights on different frequency bases. By denoting , we can see that our gradient matrix is a non-diagonal matrix, which shows that our model considers cross-neuron dependencies during training.

If the mask is an all-ones matrix, all frequencies are allowed to pass in 100%, then the transform becomes an identity map, . In this case, the gradient matrix in Eq.10

degenerates to an identity matrix and the discretization gradient degenerates to the STE.

Architecture Methods W/A Acc@1 Acc@5
ResNet-18 Full-precision 32/32 69.6 89.0
RQ 8/8 70.0 89.4
LSQ 8/8 71.1 90.1
UNIQ 4/8 67.0 -
DJPQ 4/8 (m) 69.3 -
PACT 5/5 69.8 89.3
LQ-Net 4/4 69.3 88.8
DSQ 4/4 69.5 -
APoT 4/4 69.9 89.3
2-5 FAT(Ours) 5/5 70.8 89.7
FAT(Ours) 4/4 70.5 89.5
FAT(Ours) 3/3 69.0 88.6
ResNet-34 Full-precision 32/32 73.7 91.3
LSQ 8/8 74.1 91.1
BCGD 4/4 73.4 91.4
QIL 4/4 73.7 -
DSQ 4/4 72.8 -
APoT 4/4 73.0 91.0
2-5 FAT(Ours) 5/5 74.6 91.8
FAT(Ours) 4/4 74.1 91.8
FAT(Ours) 3/3 73.2 91.2
MobileNet-V2 Full-precision 32/32 71.7 90.4
HAQ 4/32 (m) 71.4 90.2
HMQ 4/32 (m) 70.9 -
DQ 4/8 (m) 68.8 -
DJPQ 4/8 (m) 69.0 -
PACT 4/4 61.4 -
DSQ 4/4 64.8 -
2-5 FAT(Ours) 5/5 69.6 89.2
FAT(Ours) 4/4 69.2 88.9
FAT(Ours) 3/3 62.8 84.9
Table 1: Comparison on ImageNet dataset with different bitwidths for weight (W) and activation (A). We use “*” and “m” to mark our implementation results and mixed-precision, respectively. Acc@1 and Acc@5 denote top-1 and top-5 accuracy in percentage.

3.4 Complexity Analysis

The proposed FAT is summarized in Algorithm 1. Given a convolution layer with a -D weight tensor , and an input of size , where and are height and width of the input, respectively. The number of multiply-accumulate (MAC) operations in the common convolution is . Denoting , the complexity of and are both . Regardless of the data batch size, the number of MACs in magnitude computation, element-wise product and fully-connected mapping sum up to . The extra MACs introduced by the transform is . For example, assuming and the input size is , then . Hence, the transform introduces negligible training cost.

Architecture Methods W/A Accuracy
VGG-Small Full-precision 32/32 93.1
RQ 8/8 93.3
DJPQ 4/8 (m) 91.5
RQ 4/4 92.0
WAGE 2/8 93.2
2-4 FAT(Ours) 4/4 94.4
FAT(Ours) 3/3 94.3
ResNet-20 Full-precision 32/32 91.6
DSQ 1/32 90.2
PACT 4/4 90.5
APoT 4/4 92.3
2-4 FAT(Ours) 4/4 93.2
FAT(Ours) 3/3 92.8
ResNet-56 Full-precision 32/32 93.2
PACT 2/32 92.9
APoT 4/4 94.0
2-4 FAT(Ours) 4/4 94.6
FAT(Ours) 3/3 94.3
Table 2: Comparison on CIFAR-10 dataset with different bitwidths for weight (W) and activation (A).

4 Experiments

We evaluate the effectiveness of the proposed FAT on two commonly used datasets, CIFAR-10 [16] and ImageNet-ILSVRC2012 [30]. CIFAR-10 is an image classification dataset with 10 classes. ImageNet is a large dataset with 1.3M training images and 50k validation images. We adopt standard training-test data split for both datasets.

VGG-small [43], ResNet-20 and ResNet-56 [12] are used on CIFAR-10 dataset. ResNet-18, ResNet-34 and MobileNetV2 [31] are used on ImageNet dataset.

The proposed FAT is built on Pytorch framework. We compare FAT with state-of-the-art approaches, including WAGE

[40], LQ-Net [43], PACT[7], RQ [20], UNIQ [3], DQ [35], BCGD [2] [35], DSQ [10], QIL [13], HAQ [36], APoT [17], HMQ [11] DJPQ [38], LSQ [8].

We evaluate the model by trade-off among accuracy, model size and bit-operation (BOP). BOP is a general metric that considers both the bitwidth and the number of multiply-accumulate (MAC) operations [38]. The formula for bit-operation is , where and are bitwidths of weight and activation, respectively. A smaller BOP means lighter computation. We report the proposed transform on a uniform quantizer. The performance on logarithmic quantizer, a brief categorization of state-of-the-art methods and training details are elaborated in the Appendix.

4.1 Experimental Results

As shown in Tables 12, FAT surpasses previous state-of-the-art methods in accuracy without using a complicated quantizer or training tricks, such as learnable quantization stepsize in LSQ, reinforcement learning-based quantization policy in HAQ and arbitrary-bit precision in HMQ and DJPQ, etc. These methods attempt to learn powerful non-uniform quantizers to fit the distribution of original full-precision data, thereby decreasing quantization error. Compared with the approaches above, the proposed FAT achieves state-of-the-art performance without bells and whistles, trading off accuracy and bitwidth. Instead of using high bitwidth like 8-bit, we show that our method enables the commonly used networks like ResNet, VGG-Small and MobileNet to have acceptable performance in 3-bit or 4-bit setting.

Tables 12 show the power of using representation transform before quantization. By viewing the full-precision weights as images and then mapping them to another space where unimportant frequency bases are deactivated, we are capable of using a simple uniform/logarithmic quantizer to achieve competitive performance. It indicates the importance of bridging the full-precision weight to a quantization-friendly representation before quantization, especially for low bitwidth setting like 3-bit integer quantization. Our results even outperform full-precision model, since quantization has regularization effect during the training process.

4.2 Ablation Study

4.2.1 Hardware Performance

Since the weights after quantization-aware training are fixed, the transform is removed during inference. During inference, the FAT is as light as using a uniform/logarithmic quantizer on activation. FAT does not need to store extra parameters in quantizers, and employs unified quantization scheme for all layers. Hence, the proposed FAT reduces the computation power compared with most previous methods. As shown in Table 3, we compare the model size and bit-operation among different quantization methods. When quantizing both weight and activation to 4 bit, our method achieve 7.7, 7.9, 6.7 model size compression and 54.9, 58.5, 45.7 bit-operation reduction against full-precision ResNet-18, ResNet-34 and MobileNetV2, respectively.

In addition, employing a standard quantizer for all layers is hardware-friendly. For instance, if using adaptive quantization levels or mixed-precision, different quantizers need to be adopted per layer and/or channel, which greatly increases the difficulty of hardware deployment on, e.g., ARM CPU, FPGAs, etc.

4.2.2 Suppressed Frequencies in Quantization

We wonder what frequencies the quantized model prefers to keep or discard. In Figure 6, we visualize the frequency maps of weights and corresponding learned masks in 4 layers. The visualization demonstrates that the middle frequencies not only have weak spectral density in the frequency map, but are also suppressed from full-precision model to quantized model. In traditional image denoising, noises usually have weak spectral density compared with visual feature. By removing the frequencies with weak density, we could reduce the noise in the image. Here we learn the frequencies’ importance in the weight space, whereby the transform has similar effect as denoising, i.e., curtail redundant weights in a capacity-limited low-bitwidth model. It also indicates that both low and high frequencies in neurons are important for quantization.

Methods bitwidth Size (MB) C.R. BOPs (G) C.R.
ResNet-18
F.P. 32/32 46.8 1x 1863.7 1x
APoT 4/4 6.3 7.4x 36.5 51.0x
DSQ 4/4 6.3 7.4x 36.5 51.0 x
FAT(Ours) 4/4 6.1 7.7x 33.9 54.9x
FAT(Ours) 3/3 4.7 10.0x 21.5 86.7x
ResNet-34
F.P. 32/32 87.2 1x 3759.4 1x
DSQ 4/4 11.2 7.8x 69.7 53x
FAT(Ours) 4/4 11.1 7.9x 64.3 58.5x
FAT(Ours) 3/3 8.6 10.1x 38.6 97.4x
MobileNetV2
F.P. 32/32 14.1 1x 337.9 1x
DQ 4/8 (m) 2.3 6.1x 19.6 17.2x
DSQ 4/4 2.3 6.1x 13.2 25.6x
APoT 4/4 2.3 6.1x 13.2 25.6x
FAT(Ours) 4/4 2.1 6.7x 7.4 45.7x
FAT(Ours) 3/3 2.1 6.7x 5.4 62.6x
Table 3: Hardware performance in terms of model size and bit-operation on ImageNet dataset, where “C.R.” denotes corresponding compression rate.
LQ-Net APoT LogQ (Ours) UniQ (Ours)
929ms 857ms 800ms 398ms
Table 4: Time cost of different quantizers in a 4-bit ResNet-18 on an ARM mobile board, Jetson AGX Xavier. “LogQ” and “UniQ” denote logarithmic and uniform quantizers used in our method.

4.2.3 Quantization Shift via Transform

In Figure 7, we show how much full-precision weight has shifted quantization after transform, i.e., , where is the pretrained full-precision weight, and is learned transform. From Figure 7, many neurons have shifted quantization results after transform. Weight shifting boosts the information gain during quantization [26]. It verifies that we should not simply assign quantized weights near their full-precision counterparts, due to the difference of solution spaces between full-precision and low-bitwidth models.

Figure 6: Visualization of the frequency maps of convolutional weights and corresponding learned masks in 4 layers of ResNet-34. Warm color means strong spectral density in frequency map and importance learned by mask, respectively.
Figure 7: Proportions of shifted weights after FAT in various layers in ResNet-34.

4.2.4 Speed Comparison on Board

We deploy our method on a mobile device and test the real inference speed of different quantizers. The device we use is Jetson AGX Xavier, a ARM v8.2 64-bit CPU-based architecture. The inference time is reported on the ImageNet dataset with a single thread. From Table 4, we observe that the inference time cost in LQ-Net is relatively large, since different adders and multiplies are employed for adaptive quantization levels. APoT and logarithmic quantizers improve the speed because various quantization levels can be achieved by cheap shift operation. Uniform quantization enjoys high inference efficiency with the smallest time cost, since it employs unified quantizer for all layers during quantization.

5 Conclusions

This work proposes a novel Frequency-Aware Transformation (FAT) model for low-bitwidth quantization of convolutional neural networks. For the first time, we explicitly define quantization as a trained representation transform in the solution space. The soft mask in the spectral transform learns the importance of weights in each frequency bin. Backed by theoretical analyses, we show the proposed transform enables quantization error reduction and frequency-informative discretization gradient for efficient backpropagation. In addition, the transform introduces negligible extra training cost and zero test cost. Extensive experiments have demonstrated that FAT surpasses existing state-of-the-art methods with higher compression ratios and easy deployment. This work sheds light on learning quantization-friendly representations, instead of designing complicated quantizers to accommodate low-bitwidth models.

References

  • [1] R. Banner, Y. Nahshan, and D. Soudry (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, pp. 7950–7958. Cited by: §1, §1, §2.1, §3.3.1.
  • [2] C. Baskin, N. Liss, Y. Chai, E. Zheltonozhskii, E. Schwartz, R. Giryes, A. Mendelson, and A. M. Bronstein (2018) Nice: noise injection and clamping estimation for neural network quantization. arXiv preprint arXiv:1810.00162. Cited by: §4, §4.
  • [3] C. Baskin, E. Schwartz, E. Zheltonozhskii, N. Liss, R. Giryes, A. M. Bronstein, and A. Mendelson (2018) Uniq: uniform noise injection for non-uniform quantization of neural networks. arXiv preprint arXiv:1804.10969. Cited by: §2.1, §4, §4.
  • [4] Y. Bengio, N. Léonard, and A. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §1, §2.1, §3.3.2.
  • [5] L. Caccia, E. Belilovsky, M. Caccia, and J. Pineau (2020) Online learned continual compression with adaptive quantization modules. In

    International Conference on Machine Learning

    ,
    pp. 1240–1250. Cited by: §2.1.
  • [6] K. Chitsaz, M. Hajabdollahi, N. Karimi, S. Samavi, and S. Shirani (2020) Acceleration of convolutional neural network using fft-based split convolutions. arXiv preprint arXiv:2003.12621. Cited by: §2.2.
  • [7] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2.1, §4, §4.
  • [8] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha (2020) Learned step size quantization. International Conference on Learning Representations. Cited by: §1, §4, §4.
  • [9] J. Faraone, N. Fraser, M. Blott, and P. H. Leong (2018) Syq: learning symmetric quantization for efficient deep neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4300–4309. Cited by: §2.1.
  • [10] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4852–4861. Cited by: §1, §2.1, §2.1, Figure 5, §3.3.2, §4, §4.
  • [11] H. V. Habi, R. H. Jennings, and A. Netzer (2020) HMQ: hardware friendly mixed precision quantization block for cnns. arXiv preprint arXiv:2007.09952. Cited by: §1, §2.1, §4, §4.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.
  • [13] S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang, and C. Choi (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §4, §4.
  • [14] M. Kang and B. Han (2020) Operation-aware soft channel pruning using differentiable masks. In International Conference on Machine Learning, pp. 5122–5131. Cited by: §1.
  • [15] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1.
  • [16] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.
  • [17] Y. Li, X. Dong, and W. Wang (2020) Additive powers-of-two quantization: a non-uniform discretization for neural networks. International Conference on Learning Representations. Cited by: §1, §4, §4.
  • [18] J. Lin and Y. Yao (2019) A fast algorithm for convolutional neural networks using tile-based fast fourier transforms. Neural Processing Letters 50 (2), pp. 1951–1967. Cited by: §2.2.
  • [19] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao (2020) HRank: filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538. Cited by: §1.
  • [20] C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling (2018) Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875. Cited by: §4, §4.
  • [21] M. Mathieu, M. Henaff, and Y. LeCun (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851. Cited by: §2.2.
  • [22] A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr (2017) WRPN: wide reduced-precision networks. arXiv preprint arXiv:1709.01134. Cited by: §2.1.
  • [23] M. Nagel, R. A. Amjad, M. van Baalen, C. Louizos, and T. Blankevoort (2020) Up or down? adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568. Cited by: §2.1.
  • [24] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling (2019) Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: §2.1.
  • [25] N. Nguyen-Thanh, H. Le-Duc, D. Ta, and V. Nguyen (2016) Energy efficient techniques using fft for deep convolutional neural networks. In 2016 International Conference on Advanced Technologies for Communications (ATC), pp. 231–236. Cited by: §2.2.
  • [26] H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, and J. Song (2020) Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2250–2259. Cited by: §1, §2.1, §4.2.3.
  • [27] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §2.1.
  • [28] O. Rippel, J. Snoek, and R. P. Adams (2015) Spectral representations for convolutional neural networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. 2449–2457. External Links: Link Cited by: §2.2.
  • [29] O. Rippel, J. Snoek, and R. P. Adams (2015) Spectral representations for convolutional neural networks. In Advances in neural information processing systems, pp. 2449–2457. Cited by: §2.2.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1, §4.
  • [31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.
  • [32] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Jégou (2019) And the bit goes down: revisiting the quantization of neural networks. arXiv preprint arXiv:1907.05686. Cited by: §2.1.
  • [33] S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355. Cited by: §1.
  • [34] S. Tan, R. Caruana, G. Hooker, and Y. Lou (2018) Distill-and-compare: auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 303–310. Cited by: §1.
  • [35] F. Tung and G. Mori (2018) Deep neural network compression by in-parallel pruning-quantization. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4, §4.
  • [36] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: §2.1, §4, §4.
  • [37] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng (2018) Two-step quantization for low-bit neural networks. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 4376–4384. Cited by: §2.1.
  • [38] Y. Wang, Y. Lu, and T. Blankevoort (2020) Differentiable joint pruning and quantization for hardware efficiency. In European Conference on Computer Vision, pp. 259–277. Cited by: §1, §2.1, §4, §4, §4.
  • [39] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu (2016) Cnnpack: packing convolutional neural networks in the frequency domain. In Advances in neural information processing systems, pp. 253–261. Cited by: §2.2.
  • [40] S. Wu, G. Li, F. Chen, and L. Shi (2018) Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680. Cited by: §4, §4.
  • [41] M. Xiao, S. Zheng, C. Liu, Y. Wang, D. He, G. Ke, J. Bian, Z. Lin, and T. Liu (2020) Invertible image rescaling. arXiv preprint arXiv:2005.05650. Cited by: §2.2.
  • [42] K. Xu, M. Qin, F. Sun, Y. Wang, Y. Chen, and F. Ren (2020) Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1740–1749. Cited by: §2.2.
  • [43] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 365–382. Cited by: §1, §2.1, §4, §4, §4.
  • [44] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.1.

1 Quantization Error of FAT

Figure 1: Flowchart for the proof of Theorem 1.
Theorem 1.

Assume the given weights before transform satisfies distribution. Then the following two inequalities hold for both uniform and logarithmic quantization:

(1a)
(1b)
Proof.

We first show that the proposed FAT enables to tighten the weight towards zero by deactivating the weight amplitude from Eq.2a to Eq.4. Then we prove that the quantization error can be approximated as two parts, quantization noise and clipping noise. The quantization error can be written as a function related with the clipping threshold and weight amplitude. Clipping threshold is learnable during backpropagation, we assume the clipping threshold can reach its optimal value during training process. Then quantization error is positively correlated with weight amplitude only. Since the amplitude is deactivated, the quantization error deceases via the proposed FAT. Figure 1 shows the flow of the proof.

We use notations and to mean weight vectors and its frequency map, respectively. The process of 1-D Discrete Fourier transform can be expressed by matrix multiplication with as:

(2a)
(2b)

where with . Therefore, can be formalized using as:

(3)

Since Hadamard product is exchangeable, we have

(4)

Since the mask is generated with Sigmoid function, all elements in mask ranges during training. Hence, the proposed FAT helps tighten the original weights towards zero. We use amplitude and to denote and , respectively. We have .

Quantization Error Formulation. According to [1], the quantization error can be divied into two terms, namely, quantization noise and clipping noise. Without loss of generality, we use

to denote the full-precision random variable, instead of using weight notation

. is assumed to be zero-centering. The whole quantization error is formalized as below:

(5)

where is the approximated intervals between quantization levels under uniform quantizer, and are the clipping value and bitwidth, respectively. Quantization noise (the second term) considers the error within clipping threshold, and clipping noise (the first term and third term) considers the error outside clipping threshold.

In Eq. 5, [1] ignores the actual data amplitude but treat them as or , which is inconsistent with weight that the extreme values are usually far away from or . In the following, we derive a more accurate expression of the quantization error by considering data amplitude. And more generally, we extend the quantization error function from uniform quantizer to logarithmic quantizer.

Uniform Quantization Noise. In Eq. 5, the second term is the quantization noise. By assuming the density function is a construction of a piece-wise linear function, the quantization noisy can be approximated as:

(6)

By substituting into Eq. 6, the quantization noise can be further approximated as below:

(7)

Logarithmic Quantization Noise. The quantization levels of logarithmic quantization are powers-of-two values or zero as below:

(8)

where is the clipping value and is the bitwidth.

Since the quantization levels and the weight distribution are both symmetric about the x-axis, we can double the quantization noise on the positive x-axis. To approximate the error, we divide the interval into and . Therefore, the quantization noise can be formalized as below:

(9)

We assume and all values are rounded to the midpoint of the given interval as well. Eq. 10 and 11 calculate the quantization noise in and , respectively.:

(10)
(11)

where , .

By substituting Eq. 10 and 11 into Eq. 9, the quantization noise of logarithmic quantization is as follows:

(12)

Clipping Noise. The clipping noise considers the error outside the clipping threshold, which do not involves the setting quantization levels. Therefore, clipping noise is equivalent for both uniform quantizer and logarithmic quantizer. By observing Eq. 5, for symmetrical distributions around zero like distribution, the first term and third term is equal. Hence, the clipping noise can be written as:

(13)

To approximate the clipping noise more accurately, we take the real dynamic range into consideration. We denote the data amplitude as , therefore, instead of . By adding the limitation on the range of , Eq. 13 is reformalulated as:

(14)

For

distribution, its cumulative distribution function can be formalized as:

(15)

By submitting Eq. 15 into Eq. 14, the clipping noise term can be approximated as a function of and as follows:

(16)

Quantization Error related to . By gathering Eq. 7 and 16, we get the new expression of the uniform quantization error as a function of and as below:

(17)

Putting Eq. 12 and 16 together, the logarithmic quantization error can be formalized as:

(18)

Both and are functions of and .

Finding the Optimal . Clipping value is a trainable parameter during backpropagation. Ideally, we can find an optimal to minimize quantization error function . In order to make a fair comparison between and , we treat as a function only related to by finding optimal for every selected .

For uniform quantization, the optimal can be found by solving the following equation:

(19)

Similarly, the optimal is the solution of the equation below:

(20)

Though it is hard to find analytical solutions, we can always search for the numerical solutions.

Quantization Error only related to . By taking Eq. 19 into Eq. 17 and Eq. 20 into Eq. 18, we can get the expressions of quantization errors related to only. Eq 21 and shows the uniform and logarithmic cases, respectively:

(21)
(22)

Figure 2 visualizes the two curves and , which shows quantization errors keep rising when increases in both two cases. Because , we can conclude that: .

Figure 2: With optimal clipping value for every different amplitudes , the curves show the quantization error goes up as amplitude increases for both uniform and logarithmic quantization.

2 Informative Discretization Gradient

In the following we derive the the gradients of transformed weights to the mask and the original weights

during the backward propagation, respectively. To show the chain rule clearly and avoid the redundancy of symbols, we employ some new notations to denote the intermediate variables:

(23a)
(23b)

Then can be represented as .

2.1 Gradient of to

Theorem 2.

The gradient of to during backward propagation is , with

(24)

where and

Proof.

In the main paper, Figure 4 illustrates how to get the gradient matrix from the 3-D gradient tensor. Therefore, here we focus on how to compute the gradient entries.

According the chain rule, the gradient can be calculated in the following flow:

(25)

It is worth noting that since is obtained by and using element-wise product, is only related to , and has nothing to do with where .

The two terms on the right side of Eq. 25 are easy to compute, whose results are: