1 Introduction
With the popularity of deep neural networks across various usecases, there is now an increasing demand for methods that make deep networks run efficiently on resourceconstrained edgedevices. These methods include model pruning, neural architecture search (NAS) and handcrafted efficient networks made out of novel architectural blocks (e.g. depthwise separable or group convolutions, squeezeexcite blocks, etc.). Finally we can also perform model quantization, where the weights and activations are quantized to lower bitwidths allowing efficient fixedpoint inference and reduced memory bandwidth usage.
Due to the surge in more efficient architectures found with NAS, newer and more general activation functions (like Swish [22], Hswish [11], LeakyReLU) are replacing the traditional ReLU. Unlike ReLU, these activation functions also take over values below zero. Current stateoftheart quantization schemes like PACT [5] and LSQ [7] assume unsigned quantization ranges for activation quantization where all the activation values below zero are discarded by quantizing them to zero. This works well for traditional ReLUbased architectures like ResNet [10], but leads to a significant loss of information when applied to modern architectures like EfficientNet [26] and MixNet [27], which employ Swish activations. For example, LSQ achieves W4A4 quantization of preactivationResNet50 with no loss in acccuracy but leads to a % loss in accuracy when quantizing EfficientNetB0 to W4A4^{1}^{1}1WA quantization indicates quantizing the weights and output activations of all layers to bits. Naively using a signed quantization range to accommodate these negative values also results in a drop in performance.
To alleviate these drops in performance which are commonly observed with very lowbit (2, 3, 4bit) quantization, we propose using a general asymmetric quantization scheme with a learnable offset parameter as well as a learnable scale parameter. We show that the proposed quantization scheme learns to accommodate the negative activation values differently for different layers and recovers the accuracy loss incurred by LSQ, e.g. % accuracy improvement over LSQ with W4A4 quantization and upto % improvement with W2A2 quantization on EfficientNetB0. To the best of our knowledge, ours is the first work to quantize modern architectures like EfficientNet and MixNet to extremely low bitwidths.
Another problem faced especially by any gradientbased learnable quantization scheme is its sensitivity to initialization, meaning that a poor initialization can lead to a high variance in final performance across multiple training runs. This problem is especially observed with minmax initialization (used in [1]). We show that using an initialization scheme based on meansquarederror (MSE) minimization [24, 25] for the offset and scale parameters leads to significantly higher stability in final performance than minmax quantization. We also compare this initialization scheme with the one proposed in [7].
In summary, our proposed method, called LSQ+, extends LSQ [7] by adding a simple yet effective learnable offset parameter for activation quantization to recover the lost accuracy on architectures employing Swishlike activations. Furthermore, our other contribution is showing the importance of proper initialization for stable training, especially in the lowbit regime.
2 Related Work
A good overview of the basics of quantization is given in [16]
, where the differences between asymmetric and symmetric quantization are explained. In general, we can classify quantization methods into
posttraining methods that work without finetuning and quantizationaware training methods that need finetuning.Posttraining quantization methods [2, 31, 6] optimize neural networks for quantization without full training and using a little amount of data. [20, 4] do this better without using any data at all. Although these methods work well on typical 8bit quantization, they were not able to achieve good accuracy on very lowbit (2, 3, 4bit) quantization.
Quantizationaware training generally outperforms these methods on lowbit tasks given enough time to optimize. Simulated quantizationaware training methods and improvements for these are discussed in [9, 12, 17]. Essentially, operations are added to the neural network computational graph that simulate how quantization would be done on an actual device. Several recent papers improve over these methods by learning the quantization parameters, e.g. QIL [14], TQT [13] and LSQ [7]. This is the approach we build upon in our paper, but a similar asymmetric quantization scheme and initialization we suggest could be used for any other methods.
In a parallel line of research, some works [15, 18, 21] have tried to apply knowledge distillation to quantization resulting in improved performances. Also, some recent work [28] has been done on automatically learning the bitwidth alongside of the ranges. Note that our proposed method is orthogonal to these works, and thus it can be jointly used with them. Lastly, several papers have introduced different quantization grids than uniform one we use. In [19] and [29], a logarithmic space or fully freeformat quantization space are used to quantize the network. In this paper, we do not consider this, as the hardware implementations for these are simply inefficient, requiring costly lookup table or approximation on runtime.
3 Method
In LSQ [7], a symmetric quantization scheme with a trainable scale parameter is proposed for both weights and activations. This scheme is defined as follows:
(1) 
where indicates the round function and the function clamps all values between and . and denote the coded bits and quantized values, respectively. LSQ can make use of a signed or an unsigned quantization range. However, both are suboptimal for activation functions like Swish or LeakyReLU which have skewed negative and positive ranges^{2}^{2}2For example, the negative portion of Swish activation lies only between and whereas the positive portion is unbounded.. Using an unsigned quantization range, i.e. , clamps all negative activations to zero leading to a significant loss of information. On the contrary, using a signed quantization range, i.e. , will quantize all negative activations to integers in the range and all positive activations to , hence giving equal importance to the negative and positive portions of the activation function. However, this loses valuable precision for skewed distributions where the positive dynamic range is significantly larger than the negative one. In Sec. 4.1, we will show that both quantization schemes lead to a significant loss in accuracy when quantizing architectures with Swish activations.
The proposed method LSQ+ solves the above mentioned problem with a more general learnable asymmetric quantization scheme for the activations, described in Sec. 3.1. Sec. 3.2 describes the initialization scheme used in LSQ+.
3.1 Learnable asymmetric quantization
As a solution to the above mentioned problem, we propose a general asymmetric activation quantization scheme where not only the scale parameters but also the offset parameters are learned during training to handle skewed activation distributions:
(2) 
Here, the offset parameter and the scale are both learnable. The gradient update of the parameter is calculated using:
(3) 
And the gradient update of is calculated using:
(4) 
), straightthroughestimator (STE)
[3] is used in approximating and .For weight quantization, we use symmetric signed quantization (1) since the layer weights can be empirically observed to be distributed symmetrically around zero. Because of this, asymmetric quantization of activations has no additional cost during inference as compared to symmetric quantization since the additional offset term can be precomputed and incorporated into the bias at compilation time:
(5) 
Configuration  

Config 1 : Unsigned + Symmetric (LSQ)  trainable  N/A  0  
Config 2 : Signed + Symmetric  trainable  N/A  
Config 3 : Signed + Asymmetric  trainable  trainable  
Config 4 : Unsigned + Asymmetric  trainable  trainable  0 
Table 1 shows four possible parametrizations for the proposed quantization scheme in (2). Configurations 1 and 2 do not use an offset parameter, hence following the learnable symmetric quantization scheme proposed in LSQ [7]. Since Configuration 1 uses an unsigned range with this symmetric quantization scheme, it corresponds exactly to the parametrization proposed in LSQ for activation quantization. Configurations 3 and 4 learn both the scale and offset parameter for activation quantization, the only difference being signed and unsigned quantization ranges. We will analyze these different parametrizations in the experiments section.
3.2 Initialization of quantization parameters
As we enter the extremely low bitwidth
regime with gradientbased learnable quantization methods, the final performance after training becomes highly sensitive to the initialization of the quantization hyperparameters. This sensitivity problem is amplified in the presence of depthwise separable convolutions which are known to be challenging to quantize
[30]. In this work, we propose an initialization scheme for the scale and offset parameters that achieves significantly more stable and sometimes better performance than other initializations (while keeping the quantization configuration unchanged) proposed in the literature [12, 7].3.2.1 Scale initialization for weight quantization
As mentioned before, we use signed symmetric quantization for the weights (similar to Configuration 2) in our method. Hence, no offset is used for weight quantization. LSQ [7] proposes using the squareroot normalized average absolute value of layer weights, i.e. , to intialize the scale parameter. This leads to a very large initialization for 2, 3 or 4bit quantizaiton, e.g. for 4bit case. From our experiments, this initialization was observed to be far from the converged values of the scale parameters. One of the instances of this phenomenon is shown in Figure 1.
We fix this problem by using the statistics of the weight distribution rather than the actual weight values for the initialization. Similar to [20], we use a Gaussian approximation for the weight distribution in each layer. Following this, we initialize the scale parameter for each layer by:
where and are the mean (same as
) and standard deviation of the weights in that layer.
3.2.2 Scale/offset initialization for activation quantization
Let and denote the min and the max value of the activation function. For example, for ReLU and in case of Swish activations^{3}^{3}3For unbounded activation functions (e.g. positive portion of Swish), or can be estimated from a few forward passes.. Intuitively, a full utilization of the quantization range can be obtained when is quantized to the lower bound of the quantization range and to the upper bound. Following this intuition, an initialization for and would satisfy:
(6) 
Solving these constraints yields:
(7) 
But the above initialization is highly prone to outliers in the activation distribution, especially since the activation ranges are dynamic. To overcome this, we propose initializing the scale and offset parameters per layer by optimizing the MSE minimization problem, similar to
[24, 25]:(8) 
where is given by (2). There is no closedform solution to (8). Hence, we embed equations (3) and (4
) into PyTorch’s autograd functionality to optimize for
over a few batches of data.Method  W2A2  W3A3  W4A4 

Config 1 : LSQ (Unsigned + Symmetric)  43.5%  67.5%  71.9% 
Config 2 : Signed + Symmetric  23.7%  54.8%  68.8% 
Config 3 : Signed + Asymmetric  49.1%  69.9%  73.5% 
Config 4 : Unsigned + Asymmetric  48.7%  69.3%  73.8% 
Method  W2A2  W3A3  W4A4 

Config 1 : LSQ (Unsigned + Symmetric)  39.9%  64.3%  70.4% 
Config 2 : Signed + Symmetric  23.4%  62.1%  67.2% 
Config 3 : Signed + Asymmetric  42.5%  66.7%  71.6% 
Config 4 : Unsigned + Asymmetric  42.8%  66.1%  71.7% 
Method  W2A2  W3A3  W4A4 

PACT [5]  64.4%  68.1%  69.2% 
DSQ [8]  65.2%  68.7%  69.6% 
QIL [14]  65.7%  69.2%  70.1% 
Config 1 : LSQ (Unsigned + Symmetric)  66.7%  69.4%  70.7% 
Config 2 : Signed + Symmetric  64.7%  66.1%  69.2% 
Config 3 : Signed + Asymmetric  66.7%  69.4%  70.7% 
Config 4 : Unsigned + Asymmetric  66.8%  69.3%  70.8% 
4 Experiments
We evaluate the effectiveness of our method by quantizing architectures with Swish activations to W2A2, W3A3 and W4A4. To the best of our knowledge, ours is the first work to quantize such architectures to extremely low bitwidths. As a sanity check, we show that LSQ+ also maintains the performance of LSQ [7] on traditional architectures with ReLU activation function. Finally, we show the effect of using different initializations on the performance of the proposed quantization method. All experiments are performed on the ImageNet [23] dataset.
In all configurations and all experiments, the weight parameters are initialized with the pretrained floating point weights of the deep network. Although we will compare the effectiveness of different initializations for the scale/offset parameters in Sec 4.3, we use our proposed initialization from Sec 3.2 for experiments in sections 4.1 and 4.2.
4.1 Results on Swish activation
Tables 2 and 3 show the performance impact of quantization with all the configurations of the proposed method on EfficientNetB0 [26] and MixNetS [27], respectively. MixNetS uses ReLU activation in the initial 3 layers and Swish activation in rest of the layers. By using the learnable offset parameter, we observe a % and % performance improvement for W4A4 quantization on EfficientNetB0 and MixNetS respectively (see Configurations 3 and 4 compared to Configuration 1 (LSQ)). This performance improvement using our proposed learnable asymmetric quantization scheme is most prominent in the case of W2A2 quantization.
The performance of Configuration 3 (signed range + learnable offset) and 4 (unsigned range + learnable offset) is almost similar for all the bitwidths. This is because, since we learn the offset parameter, the activation range is appropriately mapped to the quantization range irrespective of it being signed or unsigned.
Another interesting observation is that Configuration 2 performs consistently worse than all other configurations. This is because, due to the lack of an offset parameter, only quantization levels are utilized by the positive part of the activation range while the positive portion of the Swish activation is much larger than the negative portion, as mentioned in Section 3. Hence, compared to Configurations 3 and 4 which allocate quantization levels for the entire activation range, Configuration 2 has a poor utilization of its quantization range, leading to a worse performance.
4.2 Results on ReLU activation
The results on ResNet shown in the LSQ paper [7] use the preactivation version of ResNet architecture [10] which has about % higher top1 ImageNet accuracy than the standard ResNet(s). Hence, for a fair comparison with other stateoftheart methods, we run our own implementation of LSQ (Configuration 1) and all other configurations on the standard ResNets. Tables 4 shows the quantization performance of all the configurations of the proposed method on ResNet18. Our implementation of LSQ (Configuration 1) can achieve a % accuracy with W4A4 quantization which is more than fullprecision accuracy of %. This is sanity check that proves that our LSQ results are at par with the original LSQ paper [7]. Also, Configurations 1, 3 and 4 outperform existing stateoftheart methods, namely PACT [5], DSQ [8] and QIL [14]. It is worth noting that, unlike EfficientNet and MixNet, there is almost no performance gap between Configurations 1, 3 and 4 when quantizing ResNet18. We attribute this to the fact that ReLU activation function has no negative component.
4.3 Effect of quantization parameter initialization
In this section, we compare three schemes for initializing the quantization scale and offset parameters. Since we use symmetric quantization for weights, no offset is used for weight quantization. Also, configurations 1 and 2 for activation quantization don’t use an offset. The three compared initialization methods are as follows:

Minmax initialization. We use the minimum and maximum values of each layer’s weights and activations (obtained over first batch of input images) to initialize the quantization scale and offset parameters. This initialization scheme is formalized in (7).

LSQ initialization. The scale for both weight quantization and activation quantization is initialized as , where indicate layer weights or activations and is the upper bound of the quantization range.

LSQ+ initialization. We intialize the weight quantization and activation quantization parameters as proposed in Sec. 3.2
For the experiments, we quantize EfficientNetB0 using Configuration 4 and perform multiple training runs with each of these initialization methods. Table 5 shows the variation () in the final performance across 5 training runs with each of these initializations. We can observe a high instability in the final performance with W2A2 quantization, especially with minmax quantization. This is because the tail of the weight or activation distribution can easily influence the scale parameter intialization with 2bit quantization. The LSQ initialization method, which initializes the weight quantization scale parameter with the squareroot normalized mean absolute value, also has a higher variation in training performance. This is because LSQ initialization leads to a large value for the which is far from the converged value as was shown in Figure 1.
Quantization Parameter Initialization  Mean Acc  

W4A4  W2A2  
Mixmax  %  % 
LSQ  %  % 
LSQ+  %  % 
5 Discussion
5.1 Learned offset values
It is interesting to observe the layerwise offset values learned by the network. Figure 2 shows one such example with Configuration 4 for W4A4 quantization of EfficientNetB0. Note that an offset is not used for quantizing the squeezeexcite layers because sigmoid activation function has no negative component. Also, there is no activation applied at the end of a bottleneck block in EfficientNet, hence we use symmetricsignedquantization for those activation layers. These layers are not shown in the plot.
We can observe that most of the values are negative, meaning that the activations are shifted “up” before being scaled and clamped between the quantization range. This shows that the quantization layers learn to accommodate the negative activation values. None of the learned values are lower than the min value of the Swish activation function (red dotted line). Because, from (4),
Hence, gradient for becomes zero as soon as .
5.2 Learned vs Fixed offset
On further observation of Figure 2, the learned offset for most layers is away from the Swish minimum value. This is because, if we try to represent the entire activation range using the quantization grid (refer (6)), it leads to coarser representation since the number of bits are fixed causing a higher quantization error. The purpose of learning the and values is to learn this tradeoff between resolution of the quantization grid and the proportion of activation range represented by the quantization grid. Hence, the learned values are not exactly equal to the min value of the activation function. But one might wonder about the performance achieved when for each layer is fixed to . Table 6 shows the difference of performance between fixed and learned offset methods.
Method  W4A4 

Fixed (LSQ)  % 
Fixed  % 
Learned  % 
6 Conclusion
In this work, targeting the lowbit quantization domain, we solve two problems: (1) quantization of deep neural networks with signed activation functions and (2) stability of training performance w.r.t. quantization. To do so, we propose a general asymmetric quantization scheme with trainable scale and offset parameters that can learn to accommodate the negative activations without using an extra sign bit. In (5), we show that using such asymmetric quantization for activations incurs zero runtime overhead. Our work is the first to quantize modern efficient architectures like EfficientNet and MixNet to extremely low bits. We show that LSQ+ significantly improves the performance of 2, 3 and 4bit quantization on these architectures. Our experiments with traditional ReLUbased ResNet18 architecture show that we can use LSQ+ instead of LSQ everywhere without hurting performance. Finally, we show that using MSEminimization based initialization scheme for the activation quantization parameters leads to a more stable performance, which is of high importance for lowbit quantizationaware training.
References
 [1] (2015) TensorFlow: largescale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §1.
 [2] (2019) Post training 4bit quantization of convolutional networks for rapiddeployment. In Advances in Neural Information Processing Systems, pp. 7948–7956. Cited by: §2.

[3]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.1.  [4] (2020) ZeroQ: A novel zero shot quantization framework. CoRR abs/2001.00281. External Links: Link, 2001.00281 Cited by: §2.
 [5] (2018) PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arxiv:805.06085. External Links: 1805.06085 Cited by: LSQ+: Improving lowbit quantization through learnable offsets and better initialization, §1, Table 4, §4.2.

[6]
(2019)
Lowbit quantization of neural networks for efficient inference.
In
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
, pp. 3009–3018. Cited by: §2.  [7] (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: LSQ+: Improving lowbit quantization through learnable offsets and better initialization, §1, §1, §1, §2, §3.1, §3.2.1, §3.2, §3, §4.2, §4.
 [8] (2019) Differentiable soft quantization: bridging fullprecision and lowbit neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4852–4861. Cited by: Table 4, §4.2.
 [9] (2015) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 1737–1746. Cited by: §2.
 [10] (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §1, §4.2.
 [11] (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §1.

[12]
(201806)
Quantization and training of neural networks for efficient integerarithmeticonly inference.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2, §3.2.  [13] (2019) Trained quantization thresholds for accurate and efficient fixedpoint inference of deep neural networks. arXiv preprint arXiv:1903.08066 2 (3), pp. 7. Cited by: §2.
 [14] (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §2, Table 4, §4.2.
 [15] (2019) QKD: quantizationaware knowledge distillation. arXiv preprint arXiv:1911.12491. Cited by: §2.
 [16] (201806) Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342. External Links: 1806.08342 Cited by: §2.
 [17] (2019) Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
 [18] (2017) Apprentice: using knowledge distillation techniques to improve lowprecision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §2.
 [19] (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arxiv:1603.01025. External Links: 1603.01025 Cited by: §2.
 [20] (2019) Datafree quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: §2, §3.2.1.
 [21] (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
 [22] (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §1.
 [23] (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.
 [24] (2017) Fixedpoint optimization of deep neural networks with adaptive step size retraining. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 1203–1207. Cited by: §1, §3.2.2.
 [25] (2015) Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §1, §3.2.2.
 [26] (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §4.1.
 [27] (2019) Mixconv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: §1, §4.1.
 [28] (2020) Mixed precision dnns: all you need is a good parametrization. In ICLR 2020, Cited by: §2.
 [29] (2017) Soft weightsharing for neural network compression. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
 [30] (2019) Haq: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: §3.2.
 [31] (2019) Improving neural network quantization without retraining using outlier channel splitting. In International Conference on Machine Learning, pp. 7543–7552. Cited by: §2.
Comments
There are no comments yet.