With the popularity of deep neural networks across various use-cases, there is now an increasing demand for methods that make deep networks run efficiently on resource-constrained edge-devices. These methods include model pruning, neural architecture search (NAS) and hand-crafted efficient networks made out of novel architectural blocks (e.g. depth-wise separable or group convolutions, squeeze-excite blocks, etc.). Finally we can also perform model quantization, where the weights and activations are quantized to lower bit-widths allowing efficient fixed-point inference and reduced memory bandwidth usage.
Due to the surge in more efficient architectures found with NAS, newer and more general activation functions (like Swish , H-swish , Leaky-ReLU) are replacing the traditional ReLU. Unlike ReLU, these activation functions also take over values below zero. Current state-of-the-art quantization schemes like PACT  and LSQ  assume unsigned quantization ranges for activation quantization where all the activation values below zero are discarded by quantizing them to zero. This works well for traditional ReLU-based architectures like ResNet , but leads to a significant loss of information when applied to modern architectures like EfficientNet  and MixNet , which employ Swish activations. For example, LSQ achieves W4A4 quantization of preactivation-ResNet50 with no loss in acccuracy but leads to a % loss in accuracy when quantizing EfficientNet-B0 to W4A4111WA quantization indicates quantizing the weights and output activations of all layers to bits. Naively using a signed quantization range to accommodate these negative values also results in a drop in performance.
To alleviate these drops in performance which are commonly observed with very low-bit (2-, 3-, 4-bit) quantization, we propose using a general asymmetric quantization scheme with a learnable offset parameter as well as a learnable scale parameter. We show that the proposed quantization scheme learns to accommodate the negative activation values differently for different layers and recovers the accuracy loss incurred by LSQ, e.g. % accuracy improvement over LSQ with W4A4 quantization and upto % improvement with W2A2 quantization on EfficientNet-B0. To the best of our knowledge, ours is the first work to quantize modern architectures like EfficientNet and MixNet to extremely low bit-widths.
Another problem faced especially by any gradient-based learnable quantization scheme is its sensitivity to initialization, meaning that a poor initialization can lead to a high variance in final performance across multiple training runs. This problem is especially observed with min-max initialization (used in ). We show that using an initialization scheme based on mean-squared-error (MSE) minimization [24, 25] for the offset and scale parameters leads to significantly higher stability in final performance than min-max quantization. We also compare this initialization scheme with the one proposed in .
In summary, our proposed method, called LSQ+, extends LSQ  by adding a simple yet effective learnable offset parameter for activation quantization to recover the lost accuracy on architectures employing Swish-like activations. Furthermore, our other contribution is showing the importance of proper initialization for stable training, especially in the low-bit regime.
2 Related Work
A good overview of the basics of quantization is given in 
, where the differences between asymmetric and symmetric quantization are explained. In general, we can classify quantization methods intopost-training methods that work without fine-tuning and quantization-aware training methods that need fine-tuning.
Post-training quantization methods [2, 31, 6] optimize neural networks for quantization without full training and using a little amount of data. [20, 4] do this better without using any data at all. Although these methods work well on typical 8-bit quantization, they were not able to achieve good accuracy on very low-bit (2, 3, 4-bit) quantization.
Quantization-aware training generally outperforms these methods on low-bit tasks given enough time to optimize. Simulated quantization-aware training methods and improvements for these are discussed in [9, 12, 17]. Essentially, operations are added to the neural network computational graph that simulate how quantization would be done on an actual device. Several recent papers improve over these methods by learning the quantization parameters, e.g. QIL , TQT  and LSQ . This is the approach we build upon in our paper, but a similar asymmetric quantization scheme and initialization we suggest could be used for any other methods.
In a parallel line of research, some works [15, 18, 21] have tried to apply knowledge distillation to quantization resulting in improved performances. Also, some recent work  has been done on automatically learning the bit-width alongside of the ranges. Note that our proposed method is orthogonal to these works, and thus it can be jointly used with them. Lastly, several papers have introduced different quantization grids than uniform one we use. In  and , a logarithmic space or fully free-format quantization space are used to quantize the network. In this paper, we do not consider this, as the hardware implementations for these are simply inefficient, requiring costly lookup table or approximation on runtime.
In LSQ , a symmetric quantization scheme with a trainable scale parameter is proposed for both weights and activations. This scheme is defined as follows:
where indicates the round function and the function clamps all values between and . and denote the coded bits and quantized values, respectively. LSQ can make use of a signed or an unsigned quantization range. However, both are suboptimal for activation functions like Swish or Leaky-ReLU which have skewed negative and positive ranges222For example, the negative portion of Swish activation lies only between and whereas the positive portion is unbounded.. Using an unsigned quantization range, i.e. , clamps all negative activations to zero leading to a significant loss of information. On the contrary, using a signed quantization range, i.e. , will quantize all negative activations to integers in the range and all positive activations to , hence giving equal importance to the negative and positive portions of the activation function. However, this loses valuable precision for skewed distributions where the positive dynamic range is significantly larger than the negative one. In Sec. 4.1, we will show that both quantization schemes lead to a significant loss in accuracy when quantizing architectures with Swish activations.
The proposed method LSQ+ solves the above mentioned problem with a more general learnable asymmetric quantization scheme for the activations, described in Sec. 3.1. Sec. 3.2 describes the initialization scheme used in LSQ+.
3.1 Learnable asymmetric quantization
As a solution to the above mentioned problem, we propose a general asymmetric activation quantization scheme where not only the scale parameters but also the offset parameters are learned during training to handle skewed activation distributions:
Here, the offset parameter and the scale are both learnable. The gradient update of the parameter is calculated using:
And the gradient update of is calculated using:
), straight-through-estimator (STE) is used in approximating and .
For weight quantization, we use symmetric signed quantization (1) since the layer weights can be empirically observed to be distributed symmetrically around zero. Because of this, asymmetric quantization of activations has no additional cost during inference as compared to symmetric quantization since the additional offset term can be precomputed and incorporated into the bias at compilation time:
|Config 1 : Unsigned + Symmetric (LSQ)||trainable||N/A||0|
|Config 2 : Signed + Symmetric||trainable||N/A|
|Config 3 : Signed + Asymmetric||trainable||trainable|
|Config 4 : Unsigned + Asymmetric||trainable||trainable||0|
Table 1 shows four possible parametrizations for the proposed quantization scheme in (2). Configurations 1 and 2 do not use an offset parameter, hence following the learnable symmetric quantization scheme proposed in LSQ . Since Configuration 1 uses an unsigned range with this symmetric quantization scheme, it corresponds exactly to the parametrization proposed in LSQ for activation quantization. Configurations 3 and 4 learn both the scale and offset parameter for activation quantization, the only difference being signed and unsigned quantization ranges. We will analyze these different parametrizations in the experiments section.
3.2 Initialization of quantization parameters
As we enter the extremely low bit-width
regime with gradient-based learnable quantization methods, the final performance after training becomes highly sensitive to the initialization of the quantization hyperparameters. This sensitivity problem is amplified in the presence of depthwise separable convolutions which are known to be challenging to quantize. In this work, we propose an initialization scheme for the scale and offset parameters that achieves significantly more stable and sometimes better performance than other initializations (while keeping the quantization configuration unchanged) proposed in the literature [12, 7].
3.2.1 Scale initialization for weight quantization
As mentioned before, we use signed symmetric quantization for the weights (similar to Configuration 2) in our method. Hence, no offset is used for weight quantization. LSQ  proposes using the square-root normalized average absolute value of layer weights, i.e. , to intialize the scale parameter. This leads to a very large initialization for 2-, 3- or 4-bit quantizaiton, e.g. for 4-bit case. From our experiments, this initialization was observed to be far from the converged values of the scale parameters. One of the instances of this phenomenon is shown in Figure 1.
We fix this problem by using the statistics of the weight distribution rather than the actual weight values for the initialization. Similar to , we use a Gaussian approximation for the weight distribution in each layer. Following this, we initialize the scale parameter for each layer by:
where and are the mean (same as
) and standard deviation of the weights in that layer.
3.2.2 Scale/offset initialization for activation quantization
Let and denote the min and the max value of the activation function. For example, for ReLU and in case of Swish activations333For unbounded activation functions (e.g. positive portion of Swish), or can be estimated from a few forward passes.. Intuitively, a full utilization of the quantization range can be obtained when is quantized to the lower bound of the quantization range and to the upper bound. Following this intuition, an initialization for and would satisfy:
Solving these constraints yields:
But the above initialization is highly prone to outliers in the activation distribution, especially since the activation ranges are dynamic. To overcome this, we propose initializing the scale and offset parameters per layer by optimizing the MSE minimization problem, similar to[24, 25]:
) into PyTorch’s autograd functionality to optimize forover a few batches of data.
|Config 1 : LSQ (Unsigned + Symmetric)||43.5%||67.5%||71.9%|
|Config 2 : Signed + Symmetric||23.7%||54.8%||68.8%|
|Config 3 : Signed + Asymmetric||49.1%||69.9%||73.5%|
|Config 4 : Unsigned + Asymmetric||48.7%||69.3%||73.8%|
|Config 1 : LSQ (Unsigned + Symmetric)||39.9%||64.3%||70.4%|
|Config 2 : Signed + Symmetric||23.4%||62.1%||67.2%|
|Config 3 : Signed + Asymmetric||42.5%||66.7%||71.6%|
|Config 4 : Unsigned + Asymmetric||42.8%||66.1%||71.7%|
|Config 1 : LSQ (Unsigned + Symmetric)||66.7%||69.4%||70.7%|
|Config 2 : Signed + Symmetric||64.7%||66.1%||69.2%|
|Config 3 : Signed + Asymmetric||66.7%||69.4%||70.7%|
|Config 4 : Unsigned + Asymmetric||66.8%||69.3%||70.8%|
We evaluate the effectiveness of our method by quantizing architectures with Swish activations to W2A2, W3A3 and W4A4. To the best of our knowledge, ours is the first work to quantize such architectures to extremely low bit-widths. As a sanity check, we show that LSQ+ also maintains the performance of LSQ  on traditional architectures with ReLU activation function. Finally, we show the effect of using different initializations on the performance of the proposed quantization method. All experiments are performed on the ImageNet  dataset.
In all configurations and all experiments, the weight parameters are initialized with the pretrained floating point weights of the deep network. Although we will compare the effectiveness of different initializations for the scale/offset parameters in Sec 4.3, we use our proposed initialization from Sec 3.2 for experiments in sections 4.1 and 4.2.
4.1 Results on Swish activation
Tables 2 and 3 show the performance impact of quantization with all the configurations of the proposed method on EfficientNet-B0  and MixNet-S , respectively. MixNet-S uses ReLU activation in the initial 3 layers and Swish activation in rest of the layers. By using the learnable offset parameter, we observe a -% and -% performance improvement for W4A4 quantization on EfficientNet-B0 and MixNet-S respectively (see Configurations 3 and 4 compared to Configuration 1 (LSQ)). This performance improvement using our proposed learnable asymmetric quantization scheme is most prominent in the case of W2A2 quantization.
The performance of Configuration 3 (signed range + learnable offset) and 4 (unsigned range + learnable offset) is almost similar for all the bit-widths. This is because, since we learn the offset parameter, the activation range is appropriately mapped to the quantization range irrespective of it being signed or unsigned.
Another interesting observation is that Configuration 2 performs consistently worse than all other configurations. This is because, due to the lack of an offset parameter, only quantization levels are utilized by the positive part of the activation range while the positive portion of the Swish activation is much larger than the negative portion, as mentioned in Section 3. Hence, compared to Configurations 3 and 4 which allocate quantization levels for the entire activation range, Configuration 2 has a poor utilization of its quantization range, leading to a worse performance.
4.2 Results on ReLU activation
The results on ResNet shown in the LSQ paper  use the pre-activation version of ResNet architecture  which has about -% higher top-1 ImageNet accuracy than the standard ResNet(s). Hence, for a fair comparison with other state-of-the-art methods, we run our own implementation of LSQ (Configuration 1) and all other configurations on the standard ResNets. Tables 4 shows the quantization performance of all the configurations of the proposed method on ResNet18. Our implementation of LSQ (Configuration 1) can achieve a % accuracy with W4A4 quantization which is more than full-precision accuracy of %. This is sanity check that proves that our LSQ results are at par with the original LSQ paper . Also, Configurations 1, 3 and 4 outperform existing state-of-the-art methods, namely PACT , DSQ  and QIL . It is worth noting that, unlike EfficientNet and MixNet, there is almost no performance gap between Configurations 1, 3 and 4 when quantizing ResNet18. We attribute this to the fact that ReLU activation function has no negative component.
4.3 Effect of quantization parameter initialization
In this section, we compare three schemes for initializing the quantization scale and offset parameters. Since we use symmetric quantization for weights, no offset is used for weight quantization. Also, configurations 1 and 2 for activation quantization don’t use an offset. The three compared initialization methods are as follows:
Min-max initialization. We use the minimum and maximum values of each layer’s weights and activations (obtained over first batch of input images) to initialize the quantization scale and offset parameters. This initialization scheme is formalized in (7).
LSQ initialization. The scale for both weight quantization and activation quantization is initialized as , where indicate layer weights or activations and is the upper bound of the quantization range.
LSQ+ initialization. We intialize the weight quantization and activation quantization parameters as proposed in Sec. 3.2
For the experiments, we quantize EfficientNet-B0 using Configuration 4 and perform multiple training runs with each of these initialization methods. Table 5 shows the variation () in the final performance across 5 training runs with each of these initializations. We can observe a high instability in the final performance with W2A2 quantization, especially with min-max quantization. This is because the tail of the weight or activation distribution can easily influence the scale parameter intialization with 2-bit quantization. The LSQ initialization method, which initializes the weight quantization scale parameter with the square-root normalized mean absolute value, also has a higher variation in training performance. This is because LSQ initialization leads to a large value for the which is far from the converged value as was shown in Figure 1.
|Quantization Parameter Initialization||Mean Acc|
5.1 Learned offset values
It is interesting to observe the layer-wise offset values learned by the network. Figure 2 shows one such example with Configuration 4 for W4A4 quantization of EfficientNet-B0. Note that an offset is not used for quantizing the squeeze-excite layers because sigmoid activation function has no negative component. Also, there is no activation applied at the end of a bottleneck block in EfficientNet, hence we use symmetric-signed-quantization for those activation layers. These layers are not shown in the plot.
We can observe that most of the values are negative, meaning that the activations are shifted “up” before being scaled and clamped between the quantization range. This shows that the quantization layers learn to accommodate the negative activation values. None of the learned values are lower than the min value of the Swish activation function (red dotted line). Because, from (4),
Hence, gradient for becomes zero as soon as .
5.2 Learned vs Fixed offset
On further observation of Figure 2, the learned offset for most layers is away from the Swish minimum value. This is because, if we try to represent the entire activation range using the quantization grid (refer (6)), it leads to coarser representation since the number of bits are fixed causing a higher quantization error. The purpose of learning the and values is to learn this trade-off between resolution of the quantization grid and the proportion of activation range represented by the quantization grid. Hence, the learned values are not exactly equal to the min value of the activation function. But one might wonder about the performance achieved when for each layer is fixed to . Table 6 shows the difference of performance between fixed and learned offset methods.
In this work, targeting the low-bit quantization domain, we solve two problems: (1) quantization of deep neural networks with signed activation functions and (2) stability of training performance w.r.t. quantization. To do so, we propose a general asymmetric quantization scheme with trainable scale and offset parameters that can learn to accommodate the negative activations without using an extra sign bit. In (5), we show that using such asymmetric quantization for activations incurs zero runtime overhead. Our work is the first to quantize modern efficient architectures like EfficientNet and MixNet to extremely low bits. We show that LSQ+ significantly improves the performance of 2-, 3- and 4-bit quantization on these architectures. Our experiments with traditional ReLU-based ResNet18 architecture show that we can use LSQ+ instead of LSQ everywhere without hurting performance. Finally, we show that using MSE-minimization based initialization scheme for the activation quantization parameters leads to a more stable performance, which is of high importance for low-bit quantization-aware training.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §1.
-  (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, pp. 7948–7956. Cited by: §2.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.1.
-  (2020) ZeroQ: A novel zero shot quantization framework. CoRR abs/2001.00281. External Links: Cited by: §2.
-  (2018) PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arxiv:805.06085. External Links: Cited by: LSQ+: Improving low-bit quantization through learnable offsets and better initialization, §1, Table 4, §4.2.
Low-bit quantization of neural networks for efficient inference.
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3009–3018. Cited by: §2.
-  (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: LSQ+: Improving low-bit quantization through learnable offsets and better initialization, §1, §1, §1, §2, §3.1, §3.2.1, §3.2, §3, §4.2, §4.
-  (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4852–4861. Cited by: Table 4, §4.2.
-  (2015) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1737–1746. Cited by: §2.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §1, §4.2.
-  (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §1.
Quantization and training of neural networks for efficient integer-arithmetic-only inference.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.2.
-  (2019) Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. arXiv preprint arXiv:1903.08066 2 (3), pp. 7. Cited by: §2.
-  (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §2, Table 4, §4.2.
-  (2019) QKD: quantization-aware knowledge distillation. arXiv preprint arXiv:1911.12491. Cited by: §2.
-  (2018-06) Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342. External Links: Cited by: §2.
-  (2019) Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2017) Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §2.
-  (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arxiv:1603.01025. External Links: Cited by: §2.
-  (2019) Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: §2, §3.2.1.
-  (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
-  (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §4.
-  (2017) Fixed-point optimization of deep neural networks with adaptive step size retraining. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 1203–1207. Cited by: §1, §3.2.2.
-  (2015) Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §1, §3.2.2.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §4.1.
-  (2019) Mixconv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: §1, §4.1.
-  (2020) Mixed precision dnns: all you need is a good parametrization. In ICLR 2020, Cited by: §2.
-  (2017) Soft weight-sharing for neural network compression. In International Conference on Learning Representations (ICLR), External Links: Cited by: §2.
-  (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: §3.2.
-  (2019) Improving neural network quantization without retraining using outlier channel splitting. In International Conference on Machine Learning, pp. 7543–7552. Cited by: §2.