LSQ+: Improving low-bit quantization through learnable offsets and better initialization

04/20/2020 ∙ by Yash Bhalgat, et al. ∙ Qualcomm Seoul National University 5

Unlike ReLU, newer activation functions (like Swish, H-swish, Mish) that are frequently employed in popular efficient architectures can also result in negative activation values, with skewed positive and negative ranges. Typical learnable quantization schemes [PACT, LSQ] assume unsigned quantization for activations and quantize all negative activations to zero which leads to significant loss in performance. Naively using signed quantization to accommodate these negative values requires an extra sign bit which is expensive for low-bit (2-, 3-, 4-bit) quantization. To solve this problem, we propose LSQ+, a natural extension of LSQ, wherein we introduce a general asymmetric quantization scheme with trainable scale and offset parameters that can learn to accommodate the negative activations. Gradient-based learnable quantization schemes also commonly suffer from high instability or variance in the final training performance, hence requiring a great deal of hyper-parameter tuning to reach a satisfactory performance. LSQ+ alleviates this problem by using an MSE-based initialization scheme for the quantization parameters. We show that this initialization leads to significantly lower variance in final performance across multiple training runs. Overall, LSQ+ shows state-of-the-art results for EfficientNet and MixNet and also significantly outperforms LSQ for low-bit quantization of neural nets with Swish activations (e.g.: 1.8 quantization and upto 5.6 ImageNet dataset). To the best of our knowledge, ours is the first work to quantize such architectures to extremely low bit-widths.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the popularity of deep neural networks across various use-cases, there is now an increasing demand for methods that make deep networks run efficiently on resource-constrained edge-devices. These methods include model pruning, neural architecture search (NAS) and hand-crafted efficient networks made out of novel architectural blocks (e.g. depth-wise separable or group convolutions, squeeze-excite blocks, etc.). Finally we can also perform model quantization, where the weights and activations are quantized to lower bit-widths allowing efficient fixed-point inference and reduced memory bandwidth usage.

Due to the surge in more efficient architectures found with NAS, newer and more general activation functions (like Swish [22], H-swish [11], Leaky-ReLU) are replacing the traditional ReLU. Unlike ReLU, these activation functions also take over values below zero. Current state-of-the-art quantization schemes like PACT [5] and LSQ [7] assume unsigned quantization ranges for activation quantization where all the activation values below zero are discarded by quantizing them to zero. This works well for traditional ReLU-based architectures like ResNet [10], but leads to a significant loss of information when applied to modern architectures like EfficientNet [26] and MixNet [27], which employ Swish activations. For example, LSQ achieves W4A4 quantization of preactivation-ResNet50 with no loss in acccuracy but leads to a % loss in accuracy when quantizing EfficientNet-B0 to W4A4111WA quantization indicates quantizing the weights and output activations of all layers to bits. Naively using a signed quantization range to accommodate these negative values also results in a drop in performance.

To alleviate these drops in performance which are commonly observed with very low-bit (2-, 3-, 4-bit) quantization, we propose using a general asymmetric quantization scheme with a learnable offset parameter as well as a learnable scale parameter. We show that the proposed quantization scheme learns to accommodate the negative activation values differently for different layers and recovers the accuracy loss incurred by LSQ, e.g. % accuracy improvement over LSQ with W4A4 quantization and upto % improvement with W2A2 quantization on EfficientNet-B0. To the best of our knowledge, ours is the first work to quantize modern architectures like EfficientNet and MixNet to extremely low bit-widths.

Another problem faced especially by any gradient-based learnable quantization scheme is its sensitivity to initialization, meaning that a poor initialization can lead to a high variance in final performance across multiple training runs. This problem is especially observed with min-max initialization (used in [1]). We show that using an initialization scheme based on mean-squared-error (MSE) minimization [24, 25] for the offset and scale parameters leads to significantly higher stability in final performance than min-max quantization. We also compare this initialization scheme with the one proposed in [7].

In summary, our proposed method, called LSQ+, extends LSQ [7] by adding a simple yet effective learnable offset parameter for activation quantization to recover the lost accuracy on architectures employing Swish-like activations. Furthermore, our other contribution is showing the importance of proper initialization for stable training, especially in the low-bit regime.

2 Related Work

A good overview of the basics of quantization is given in [16]

, where the differences between asymmetric and symmetric quantization are explained. In general, we can classify quantization methods into

post-training methods that work without fine-tuning and quantization-aware training methods that need fine-tuning.

Post-training quantization methods [2, 31, 6] optimize neural networks for quantization without full training and using a little amount of data. [20, 4] do this better without using any data at all. Although these methods work well on typical 8-bit quantization, they were not able to achieve good accuracy on very low-bit (2, 3, 4-bit) quantization.

Quantization-aware training generally outperforms these methods on low-bit tasks given enough time to optimize. Simulated quantization-aware training methods and improvements for these are discussed in [9, 12, 17]. Essentially, operations are added to the neural network computational graph that simulate how quantization would be done on an actual device. Several recent papers improve over these methods by learning the quantization parameters, e.g. QIL [14], TQT [13] and LSQ [7]. This is the approach we build upon in our paper, but a similar asymmetric quantization scheme and initialization we suggest could be used for any other methods.

In a parallel line of research, some works [15, 18, 21] have tried to apply knowledge distillation to quantization resulting in improved performances. Also, some recent work [28] has been done on automatically learning the bit-width alongside of the ranges. Note that our proposed method is orthogonal to these works, and thus it can be jointly used with them. Lastly, several papers have introduced different quantization grids than uniform one we use. In [19] and [29], a logarithmic space or fully free-format quantization space are used to quantize the network. In this paper, we do not consider this, as the hardware implementations for these are simply inefficient, requiring costly lookup table or approximation on runtime.

3 Method

In LSQ [7], a symmetric quantization scheme with a trainable scale parameter is proposed for both weights and activations. This scheme is defined as follows:

(1)

where indicates the round function and the function clamps all values between and . and denote the coded bits and quantized values, respectively. LSQ can make use of a signed or an unsigned quantization range. However, both are suboptimal for activation functions like Swish or Leaky-ReLU which have skewed negative and positive ranges222For example, the negative portion of Swish activation lies only between and whereas the positive portion is unbounded.. Using an unsigned quantization range, i.e. , clamps all negative activations to zero leading to a significant loss of information. On the contrary, using a signed quantization range, i.e. , will quantize all negative activations to integers in the range and all positive activations to , hence giving equal importance to the negative and positive portions of the activation function. However, this loses valuable precision for skewed distributions where the positive dynamic range is significantly larger than the negative one. In Sec. 4.1, we will show that both quantization schemes lead to a significant loss in accuracy when quantizing architectures with Swish activations.

The proposed method LSQ+ solves the above mentioned problem with a more general learnable asymmetric quantization scheme for the activations, described in Sec. 3.1. Sec. 3.2 describes the initialization scheme used in LSQ+.

3.1 Learnable asymmetric quantization

As a solution to the above mentioned problem, we propose a general asymmetric activation quantization scheme where not only the scale parameters but also the offset parameters are learned during training to handle skewed activation distributions:

(2)

Here, the offset parameter and the scale are both learnable. The gradient update of the parameter is calculated using:

(3)

And the gradient update of is calculated using:

(4)

In both (3) and (4

), straight-through-estimator (STE)

[3] is used in approximating and .

For weight quantization, we use symmetric signed quantization (1) since the layer weights can be empirically observed to be distributed symmetrically around zero. Because of this, asymmetric quantization of activations has no additional cost during inference as compared to symmetric quantization since the additional offset term can be precomputed and incorporated into the bias at compilation time:

(5)
Configuration
Config 1 : Unsigned + Symmetric (LSQ) trainable N/A 0
Config 2 : Signed + Symmetric trainable N/A
Config 3 : Signed + Asymmetric trainable trainable
Config 4 : Unsigned + Asymmetric trainable trainable 0
Table 1: Different possible parametrizations for LSQ+’s learnable asymmetric quantization scheme

Table 1 shows four possible parametrizations for the proposed quantization scheme in (2). Configurations 1 and 2 do not use an offset parameter, hence following the learnable symmetric quantization scheme proposed in LSQ [7]. Since Configuration 1 uses an unsigned range with this symmetric quantization scheme, it corresponds exactly to the parametrization proposed in LSQ for activation quantization. Configurations 3 and 4 learn both the scale and offset parameter for activation quantization, the only difference being signed and unsigned quantization ranges. We will analyze these different parametrizations in the experiments section.

3.2 Initialization of quantization parameters

As we enter the extremely low bit-width

regime with gradient-based learnable quantization methods, the final performance after training becomes highly sensitive to the initialization of the quantization hyperparameters. This sensitivity problem is amplified in the presence of depthwise separable convolutions which are known to be challenging to quantize

[30]. In this work, we propose an initialization scheme for the scale and offset parameters that achieves significantly more stable and sometimes better performance than other initializations (while keeping the quantization configuration unchanged) proposed in the literature [12, 7].

3.2.1 Scale initialization for weight quantization

As mentioned before, we use signed symmetric quantization for the weights (similar to Configuration 2) in our method. Hence, no offset is used for weight quantization. LSQ [7] proposes using the square-root normalized average absolute value of layer weights, i.e. , to intialize the scale parameter. This leads to a very large initialization for 2-, 3- or 4-bit quantizaiton, e.g. for 4-bit case. From our experiments, this initialization was observed to be far from the converged values of the scale parameters. One of the instances of this phenomenon is shown in Figure 1.

We fix this problem by using the statistics of the weight distribution rather than the actual weight values for the initialization. Similar to [20], we use a Gaussian approximation for the weight distribution in each layer. Following this, we initialize the scale parameter for each layer by:

where and are the mean (same as

) and standard deviation of the weights in that layer.

Figure 1: Figure shows the scale parameter of weight quantizer in blocks.1.conv.0 layer of EfficientNet-B0 before and after finetuning with LSQ and LSQ+ initializations. For both experiments, we used configuration 4 for activation quantization. As shown, LSQ init of the scale is further from the converged value as compared to LSQ+. More on effects of initialization in Sec 4.3

3.2.2 Scale/offset initialization for activation quantization

Let and denote the min and the max value of the activation function. For example, for ReLU and in case of Swish activations333For unbounded activation functions (e.g. positive portion of Swish), or can be estimated from a few forward passes.. Intuitively, a full utilization of the quantization range can be obtained when is quantized to the lower bound of the quantization range and to the upper bound. Following this intuition, an initialization for and would satisfy:

(6)

Solving these constraints yields:

(7)

But the above initialization is highly prone to outliers in the activation distribution, especially since the activation ranges are dynamic. To overcome this, we propose initializing the scale and offset parameters per layer by optimizing the MSE minimization problem, similar to

[24, 25]:

(8)

where is given by (2). There is no closed-form solution to (8). Hence, we embed equations (3) and (4

) into PyTorch’s autograd functionality to optimize for

over a few batches of data.

Method W2A2 W3A3 W4A4
Config 1 : LSQ (Unsigned + Symmetric) 43.5% 67.5% 71.9%
Config 2 : Signed + Symmetric 23.7% 54.8% 68.8%
Config 3 : Signed + Asymmetric 49.1% 69.9% 73.5%
Config 4 : Unsigned + Asymmetric 48.7% 69.3% 73.8%
Table 2: Comparison of all configurations of quantization with EfficientNet-B0 (FP accuracy: 76.1%)
Method W2A2 W3A3 W4A4
Config 1 : LSQ (Unsigned + Symmetric) 39.9% 64.3% 70.4%
Config 2 : Signed + Symmetric 23.4% 62.1% 67.2%
Config 3 : Signed + Asymmetric 42.5% 66.7% 71.6%
Config 4 : Unsigned + Asymmetric 42.8% 66.1% 71.7%
Table 3: Comparison of all configurations of quantization with MixNet-S (FP accuracy: 75.9%)
Method W2A2 W3A3 W4A4
PACT [5] 64.4% 68.1% 69.2%
DSQ [8] 65.2% 68.7% 69.6%
QIL [14] 65.7% 69.2% 70.1%
Config 1 : LSQ (Unsigned + Symmetric) 66.7% 69.4% 70.7%
Config 2 : Signed + Symmetric 64.7% 66.1% 69.2%
Config 3 : Signed + Asymmetric 66.7% 69.4% 70.7%
Config 4 : Unsigned + Asymmetric 66.8% 69.3% 70.8%
Table 4: Comparison of all configurations of quantization with ResNet18 (FP accuracy: 70.1%)

4 Experiments

We evaluate the effectiveness of our method by quantizing architectures with Swish activations to W2A2, W3A3 and W4A4. To the best of our knowledge, ours is the first work to quantize such architectures to extremely low bit-widths. As a sanity check, we show that LSQ+ also maintains the performance of LSQ [7] on traditional architectures with ReLU activation function. Finally, we show the effect of using different initializations on the performance of the proposed quantization method. All experiments are performed on the ImageNet [23] dataset.

In all configurations and all experiments, the weight parameters are initialized with the pretrained floating point weights of the deep network. Although we will compare the effectiveness of different initializations for the scale/offset parameters in Sec 4.3, we use our proposed initialization from Sec 3.2 for experiments in sections 4.1 and 4.2.

4.1 Results on Swish activation

Tables 2 and 3 show the performance impact of quantization with all the configurations of the proposed method on EfficientNet-B0 [26] and MixNet-S [27], respectively. MixNet-S uses ReLU activation in the initial 3 layers and Swish activation in rest of the layers. By using the learnable offset parameter, we observe a -% and -% performance improvement for W4A4 quantization on EfficientNet-B0 and MixNet-S respectively (see Configurations 3 and 4 compared to Configuration 1 (LSQ)). This performance improvement using our proposed learnable asymmetric quantization scheme is most prominent in the case of W2A2 quantization.

The performance of Configuration 3 (signed range + learnable offset) and 4 (unsigned range + learnable offset) is almost similar for all the bit-widths. This is because, since we learn the offset parameter, the activation range is appropriately mapped to the quantization range irrespective of it being signed or unsigned.

Another interesting observation is that Configuration 2 performs consistently worse than all other configurations. This is because, due to the lack of an offset parameter, only quantization levels are utilized by the positive part of the activation range while the positive portion of the Swish activation is much larger than the negative portion, as mentioned in Section 3. Hence, compared to Configurations 3 and 4 which allocate quantization levels for the entire activation range, Configuration 2 has a poor utilization of its quantization range, leading to a worse performance.

4.2 Results on ReLU activation

The results on ResNet shown in the LSQ paper [7] use the pre-activation version of ResNet architecture [10] which has about -% higher top-1 ImageNet accuracy than the standard ResNet(s). Hence, for a fair comparison with other state-of-the-art methods, we run our own implementation of LSQ (Configuration 1) and all other configurations on the standard ResNets. Tables 4 shows the quantization performance of all the configurations of the proposed method on ResNet18. Our implementation of LSQ (Configuration 1) can achieve a % accuracy with W4A4 quantization which is more than full-precision accuracy of %. This is sanity check that proves that our LSQ results are at par with the original LSQ paper [7]. Also, Configurations 1, 3 and 4 outperform existing state-of-the-art methods, namely PACT [5], DSQ [8] and QIL [14]. It is worth noting that, unlike EfficientNet and MixNet, there is almost no performance gap between Configurations 1, 3 and 4 when quantizing ResNet18. We attribute this to the fact that ReLU activation function has no negative component.

4.3 Effect of quantization parameter initialization

In this section, we compare three schemes for initializing the quantization scale and offset parameters. Since we use symmetric quantization for weights, no offset is used for weight quantization. Also, configurations 1 and 2 for activation quantization don’t use an offset. The three compared initialization methods are as follows:

  1. Min-max initialization. We use the minimum and maximum values of each layer’s weights and activations (obtained over first batch of input images) to initialize the quantization scale and offset parameters. This initialization scheme is formalized in (7).

  2. LSQ initialization. The scale for both weight quantization and activation quantization is initialized as , where indicate layer weights or activations and is the upper bound of the quantization range.

  3. LSQ+ initialization. We intialize the weight quantization and activation quantization parameters as proposed in Sec. 3.2

For the experiments, we quantize EfficientNet-B0 using Configuration 4 and perform multiple training runs with each of these initialization methods. Table 5 shows the variation () in the final performance across 5 training runs with each of these initializations. We can observe a high instability in the final performance with W2A2 quantization, especially with min-max quantization. This is because the tail of the weight or activation distribution can easily influence the scale parameter intialization with 2-bit quantization. The LSQ initialization method, which initializes the weight quantization scale parameter with the square-root normalized mean absolute value, also has a higher variation in training performance. This is because LSQ initialization leads to a large value for the which is far from the converged value as was shown in Figure 1.

Quantization Parameter Initialization Mean Acc
W4A4 W2A2
Mix-max % %
LSQ % %
LSQ+ % %
Table 5: around mean accuracy across 5 training runs for EfficientNet quantization using Config 4 with different initializations. Note: other tables show the best accuracy after grid search on hyperparameters, which is different from mean accuracy.

5 Discussion

5.1 Learned offset values

It is interesting to observe the layer-wise offset values learned by the network. Figure 2 shows one such example with Configuration 4 for W4A4 quantization of EfficientNet-B0. Note that an offset is not used for quantizing the squeeze-excite layers because sigmoid activation function has no negative component. Also, there is no activation applied at the end of a bottleneck block in EfficientNet, hence we use symmetric-signed-quantization for those activation layers. These layers are not shown in the plot.

Figure 2: Layerwise values after covergence for EfficientNet-B0

We can observe that most of the values are negative, meaning that the activations are shifted “up” before being scaled and clamped between the quantization range. This shows that the quantization layers learn to accommodate the negative activation values. None of the learned values are lower than the min value of the Swish activation function (red dotted line). Because, from (4),

Hence, gradient for becomes zero as soon as .

5.2 Learned vs Fixed offset

On further observation of Figure 2, the learned offset for most layers is away from the Swish minimum value. This is because, if we try to represent the entire activation range using the quantization grid (refer (6)), it leads to coarser representation since the number of bits are fixed causing a higher quantization error. The purpose of learning the and values is to learn this trade-off between resolution of the quantization grid and the proportion of activation range represented by the quantization grid. Hence, the learned values are not exactly equal to the min value of the activation function. But one might wonder about the performance achieved when for each layer is fixed to . Table 6 shows the difference of performance between fixed and learned offset methods.

Method W4A4
Fixed (LSQ) %
Fixed %
Learned %
Table 6: Performance difference between fixed and learned offset for EfficientNet quantization at W4A4 using Config 4

6 Conclusion

In this work, targeting the low-bit quantization domain, we solve two problems: (1) quantization of deep neural networks with signed activation functions and (2) stability of training performance w.r.t. quantization. To do so, we propose a general asymmetric quantization scheme with trainable scale and offset parameters that can learn to accommodate the negative activations without using an extra sign bit. In (5), we show that using such asymmetric quantization for activations incurs zero runtime overhead. Our work is the first to quantize modern efficient architectures like EfficientNet and MixNet to extremely low bits. We show that LSQ+ significantly improves the performance of 2-, 3- and 4-bit quantization on these architectures. Our experiments with traditional ReLU-based ResNet18 architecture show that we can use LSQ+ instead of LSQ everywhere without hurting performance. Finally, we show that using MSE-minimization based initialization scheme for the activation quantization parameters leads to a more stable performance, which is of high importance for low-bit quantization-aware training.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §1.
  • [2] R. Banner, Y. Nahshan, and D. Soudry (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, pp. 7948–7956. Cited by: §2.
  • [3] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    arXiv preprint arXiv:1308.3432. Cited by: §3.1.
  • [4] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) ZeroQ: A novel zero shot quantization framework. CoRR abs/2001.00281. External Links: Link, 2001.00281 Cited by: §2.
  • [5] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arxiv:805.06085. External Links: 1805.06085 Cited by: LSQ+: Improving low-bit quantization through learnable offsets and better initialization, §1, Table 4, §4.2.
  • [6] Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev (2019) Low-bit quantization of neural networks for efficient inference. In

    2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

    ,
    pp. 3009–3018. Cited by: §2.
  • [7] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: LSQ+: Improving low-bit quantization through learnable offsets and better initialization, §1, §1, §1, §2, §3.1, §3.2.1, §3.2, §3, §4.2, §4.
  • [8] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan (2019) Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4852–4861. Cited by: Table 4, §4.2.
  • [9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1737–1746. Cited by: §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §1, §4.2.
  • [11] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §1.
  • [12] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018-06) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2, §3.2.
  • [13] S. R. Jain, A. Gural, M. Wu, and C. H. Dick (2019) Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. arXiv preprint arXiv:1903.08066 2 (3), pp. 7. Cited by: §2.
  • [14] S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang, and C. Choi (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §2, Table 4, §4.2.
  • [15] J. Kim, Y. Bhalgat, J. Lee, C. Patel, and N. Kwak (2019) QKD: quantization-aware knowledge distillation. arXiv preprint arXiv:1911.12491. Cited by: §2.
  • [16] R. Krishnamoorthi (2018-06) Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342. External Links: 1806.08342 Cited by: §2.
  • [17] C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling (2019) Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [18] A. Mishra and D. Marr (2017) Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §2.
  • [19] D. Miyashita, E. H. Lee, and B. Murmann (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arxiv:1603.01025. External Links: 1603.01025 Cited by: §2.
  • [20] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling (2019) Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: §2, §3.2.1.
  • [21] A. Polino, R. Pascanu, and D. Alistarh (2018) Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
  • [22] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Searching for activation functions. arXiv preprint arXiv:1710.05941. Cited by: §1.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.
  • [24] S. Shin, Y. Boo, and W. Sung (2017) Fixed-point optimization of deep neural networks with adaptive step size retraining. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 1203–1207. Cited by: §1, §3.2.2.
  • [25] W. Sung, S. Shin, and K. Hwang (2015) Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §1, §3.2.2.
  • [26] M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §4.1.
  • [27] M. Tan and Q. V. Le (2019) Mixconv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: §1, §4.1.
  • [28] S. Uhlich, L. Mauch, F. Cardinaux, K. Yoshiyama, J. A. Garcia, S. Tiedemann, T. Kemp, and A. Nakamura (2020) Mixed precision dnns: all you need is a good parametrization. In ICLR 2020, Cited by: §2.
  • [29] K. Ullrich, E. Meeds, and M. Welling (2017) Soft weight-sharing for neural network compression. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2.
  • [30] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8612–8620. Cited by: §3.2.
  • [31] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang (2019) Improving neural network quantization without retraining using outlier channel splitting. In International Conference on Machine Learning, pp. 7543–7552. Cited by: §2.