1 Introduction
Deep learning models require a very large amount of resources during their training (repeated forward and backward propagation) as well as inference (forward propagation). Further, the latter is often performed on the edge devices, such as smartphones or embedded systems, which operate within strict size, temperature and power budget Shimpi (2011); Humrick (2017); Dolbeau (2018); NVIDIA (2018). As a result these devices can perform a limited # of operations per second^{1}^{1}1
The peak for ARM Cortex is based on fp32 “VMLA.F32 Qd, Qn Dm”, fp16 “VMLA.F16 Qd, Qn Dm” and int8 “VMLAL.S8 Qd, Dn, Dm” instructions, with estimated reciprocal throughput
and width for A7, reciprocal throughput and width for A75. Then, peak ops are defined as (frequency/throughput)*width*cores. Also, the power is assumed to be 750mW and 1W per core for ARM Cortex A7 and A75, respectively., as illustrated in Tab. 1.CPU/GPU  GHz  Watts  fp32  fp16  int8 

ARM Cortex A7(2core)  1.5  1.5  3  12  
ARM Cortex A75(4core)  3.0  4  48  96  96 
NVIDIA Turing Tesla T4  1.35  70  8100  16200  130000 
In order to decrease the storage and compute requirements of the model during inference its parameters are often stored as integers with a low number of bits. It is common to use 8bit integers (1 Byte), rather than 16 (2 Bytes) or 32bit (4 Bytes) floating point numbers. The process of converting model parameters from “continuous” floating point to discrete integer numbers is called quantization.
Let the original optimization problem be
(1) 
where
is the loss measured during training. There are many different quantization schemes based on symmetric vs. asymetric intervals, uniform vs. nonuniform discrete partitioning, different rounding modes and choices for handling the outliers, e.g. few elements that lie outside of the range of most of the other elements.
Let us consider a uniform quantization of the parameter weights , for example from the either convolution (14) or fully connected layers (15), and activations. Then, the quantized problem is commonly written as
(2) 
where is the quantization function.
For instance, if we use only two intervals then the process is referred to as binarization and resulting element can be stored in a single bit
Courbariaux et al. (2015, 2016); Hubara et al. (2018); Rastegari et al. (2016). It can be performed using a single threshold point as shown below(3) 
The ternary networks use three intervals with resulting elements stored in 2 bits Li et al. (2016); Mellempudi et al. (2017); Choi et al. (2018). Then, quantization can be performed using two threshold points and resulting in
(4) 
Finally, let arbitrary # of bits correspond to points and intervals. Let us assume that we would like to quantize floating point number , with and length . A uniform quantization can be performed symetrically in the interval using multiplier , so that
(5) 
with effective points because is double counted.
On the other hand, notice that we can shift the interval to the interval located around by adding a scalar bias term . Therefore, uniform asymetric quantization can be performed with bias using multiplier , so that
(6) 
where round operation rounds a floating point to an integer value Wen et al. (2016); Jacob et al. (2017); Krishnamoorthi (2018).
The advantage of symmetric quantization is that for sparse parameters, with a lot of elements, the sparsity is preserved. Note that computation with zeroes can be skipped in hardware Albericio et al. (2016); Venkatesh et al. (2016); Reagen et al. (2016); Chen et al. (2017); Kim et al. (2017); Parashar et al. (2017). The disadvatage is that for highly asymetric intervals many discrete representations may be wasted.
The nonuniform quantization assigns discrete points to the interval based on the distribution of floating point values in it Bagherinezhad et al. (2017); Wang et al. (2018)
. Therefore, it does not have a fixed stride from one point to the next. Its advantage is that the encoded values are more representative of the original ones, but at the same time it can be hard to map back and perform operations with them.
The techniques for handling outliers and determining maximum thresholds, e.g. using adaptive schemes, Kullback–Leibler (KL) divergence measured loss of information, or L2 error minimization in Caffe2, have been investigated in Jia et al. (2014); Migasz ; Zhou et al. (2017); Park et al. (2018).
However, independent of all of these choices, notice that a common trend among (3)  (6) is that fixed thresholds are used in quantization function to clamp floating point values to discrete points. We point out that the matrix and neural networkbased compression techniques are outside the scope of this paper Gong et al. (2014); Denton et al. (2014); Jaderberg et al. (2014); Mishra & Marr (2018).
In this paper we will focus on a very different approach for uniform quantization using periodic functions, such as trigonometric sine (or cosine) as well as hat functions. We discuss uniform quantization, but our ideas can be generalized to nonuniform case using variations of these periodic functions with decaying amplitude and increasing base lengths away from the origin Stenger (1993); Strang & Fix (2008).
2 Periodic Functions as Regularizers
We propose an unorthodox approach for quantizing the weights of a neural network. Instead of using quantization function , we propose proposed adding a regularization term to the loss, so that the resulting optimization problem is written as
(7) 
where is a scalar scaling parameter.
The regularization term is a sum of periodic functions that push the values of the weights (and potentially activations) to a set of discrete points during training. Next we will discuss different choices for these functions.
2.1 Trigonometric (Continuous) Functions
Let us focus only on the weights and use trigonometric sine, so that
(8) 
where is the maximum weight in absolute value as defined in (5).
Notice that for frequency=1 the function R(w) attains its minimum 0.0, when the weight values are distributed at 3 discrete locations, while for frequency=7 it attains its minimum 0.0, when the weight values are distributed at 15 discrete locations, as shown on Fig. 0(a) and 0(b), respectively.
2.2 Hat (NonContinuous) Functions
Let us focus only on the weights and use a hat function, so that
(9) 
where is the maximum weight in absolute value as defined in (5).
Once again, notice that for frequency=1 the function R(w) attains its minimum 0.0, when the weight values are distributed at 3 discrete locations, while for frequency=7 it attains its minimum 0.0, when the weight values are distributed at 15 discrete locations, as shown on Fig. 2(a) and 2(b), respectively. Here we show variant of the hat function corresponding to sine, while a shifted variant corresponding to cosine is also possible.
Notice that the use of regularization for the purpose of quantization has been suggested in Hung et al. (2015). However, in this earlier work the authors use distance from fixed points (centroids) as a penalty measure to ensure quantization. This contrasts with our periodic trigonometric sine and hat functions, with amplitude and frequency hyperparameters defined in (8) and (9), respectively.
It is important to highlight a few differences between sine (or cosine) and hat functions. Notice that sine function has very nice properties. It is periodic, continuous and differentiable. However, it is not convex, unlike many of the existing regularizers. Also, notice that the maximum value of the regularizer R(w) is know ahead of time. It can be computed by assuming that all weights translate into value 1.0 after application of the function. Then, the regularizer can be scaled by a constant , such that R(w) . This can be used to facilitate and in fact define the regularizer scaling in (7), therefore reducing the number of hyperparameters.
Also, sine function has a gradient that is zero (or close to zero) in the neighborhood of points where it attains it’s minimum and maximum values. This property might make escaping the maximum or approaching the minimum slow in their respective neighborhoods. On the other hand, hat function is nonconvex and noncontinuous, with constant gradient towards the minimum except for the points where it attains its minimum and maximum values, where the gradient does not exist. These tradeoffs might guide the choice between these functions, in a way similar to that of a choice between Sigmoid and ReLU activation functions.
Finally, notice that amplitude can be changed adaptively during the training procedure, which allows us to obtain higher test accuracy, as will be shown in the experiments section. The frequency can also be varied during training, but these experiments are outside of the scope of this paper.
2.3 From Bits to Frequency and ViceVersa
In practice we are interested in selecting the number of bits to be used for quantization. For the sine and associated hat function the frequency corresponding to number of bits can be found by using
(10) 
and viceversa
(11) 
so that frequency 1 implies 2 bits, while frequency 7 implies 4 bits.
On the other hand, for cosine the frequency corresponding to number of bits can be found by using
(12) 
and viceversa
(13) 
For instance, frequency 1 implies 1 bit, while frequency 8 implies 4 bits, and so on and so forth.
3 Experiments
In this section we will investigate the accuracy of ResNet20 on CIFAR10 and ResNet50 on ImageNet datasets He et al. (2015); Krizhevsky et al. (2009); Deng et al. (2009)
. We will compare the test error achieved by the original and quantized models with loss function defined in (
1) and (7), respectively. The regularization term we add to the loss in (7) relies on periodic functions: trigonometric sine in (8) and hat in (9). It can be computed using the following PyTorch
Paszke et al. (2017) code snippetdef periodic_regularization(model, amplitude, frequency): pi = 3.141592 total = 0 for m in model.modules(): if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear): ic = 1/w.abs().max() rw = torch.sum(amplitude * #either sin torch.pow(torch.sin(pi * frequency * (w * ic)), 2)) #or hat function torch.abs(((((w * ic)  0.5) * frequency) % 1) * 2 1))
The training is performed using batch size 256 with default 100 epochs for CIFAR10 and 90 epochs for ImageNet dataset. We use a fixed schedule that adjusts the amplitude hyperparameter every
epochs. We start with a small amplitude, such as , and progressively adjust it until it reaches, say after typical epochs of training. Notice that amplitude subsumes the scaling hyperparameter , which is always set to 1.0. Note that other than using a fixed schedule we do not require any special treatment for the first or last model layers or training epochs, which is otherwise often required to produce good approximations. We show the results of representative runs.After training the model is quantized using symmetric uniform quantization in (5), which can be performed using the following PyTorch Paszke et al. (2017) code snippet
def quantize_model(model, frequency): def quantize_weights(m): if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear): c = m.weight.abs().max().data m.weight.data.mul_(frequency/c) m.weight.data.round_() m.weight.data.mul_(c/frequency) model.apply(quantize_weights)
We illustrate the difference between original, original with regularization (Regu), and quantized model (Quant) for CIFAR10 dataset on Fig. 4. The accuracy of the original model is plotted with a black dotted line, while the accuracy of other models is plotted with color bars. Notice that the model accuracy changes significantly depending on the number of bits used for quantization. For instance, there seems to be a clear boundary between 4 and 8 bits, where there seems to be (not or) enough bits to represent the information. Notice that while the training succeeds in all cases, the quantization fails to produce accurate results with less than 8 bits.
Also, we illustrate the attained model accuracy with different starting amplitudes for CIFAR10 dataset on Fig. 5. The accuracy of the original model is plotted with a black dotted line, while the accuracy of the 8bit quantized model is plotted with color bars. Notice that using adaptive rather than static amplitude allows us to reach higher test accuracy. Also, in our experiments we have found that it is a good practice to target the initial amplitude and choice of fixed schedule such that the final amplitude is in the range of 0.01  0.001, which would correspond to a reasonable value of the regularization scaling . We observe similar results on the ImageNet dataset, as seen on Fig. 6.
Finally, notice that both sine and hat functions perform as well on the ImageNet dataset, as shown in Fig. 7. Once again, the accuracy of the original model is plotted with a black dotted line, while the accuracy of the 8bit quantized model is plotted with color bars. In all plots, amplitude denotes the final amplitude. The detailed results are also summarized in tables Tab. 2 and 3.
Default Model  Quantized model (with sine)  

test (best)  8 bits  4 bits  2 bits  
Amplitude  n/a  0.1  0.01  0.001  0.001  0.001 
Test error (fixed)  84.72 (87.70)  29.26  58.18  79.18  n/a  n/a 
Test error (dyn)  84.72 (87.70)  53.28  74.14  87.46  16.66  10.20 
Default Model  Quantized model (with sine)  (with hat)  

8 bits  8 bits  
Amplitude  n/a  0.1  0.01  0.001  0.001 
Top1 error  75.84  1.29  69.02  75.77  75.57 
Top5 error  92.90  4.62  89.27  92.54  92.58 
4 Conclusion and Future Work
We have proposed a novel technique for quantizing neural networks, based on regularization with periodic functions. We have shown that it can be effectively used to quantize ResNets on CIFAR10 and ImageNet datasets. In our experiments we have achieved virtually no losses visàvis standard model by using amplitude scaling on a fixed schedule through training followed by 8bit integer quantization. While similar quality results exist for quantization of CNNs, in this note we have achieved them through a completely novel method. In the future, we would like to incorporate the quantization of activations into this approach and experiment with more classes of neural networks.
Acknowledgements
The authors would like to thank Marat Dukhan, Bram Wasti and Satish Nadathur for collecting ARM Cortex A7 and A75 processor intruction information as well as Misha Smelyanskiy for his helpful comments and suggestions.
References

Albericio et al. (2016)
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos.
Cnvlutin: Ineffectualneuronfree deep neural network computing.
ACM/IEEE 43rd Annual International Symposium on Computer Architecture, 2016. 
Bagherinezhad et al. (2017)
H. Bagherinezhad, M. Rastegari, and A. Farhadi.
LCNN: Lookupbased convolutional neural network.
Proc. Computer Vision and Pattern Recognition
, 2017.  Chen et al. (2017) Y.H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52:127–138, 2017.
 Choi et al. (2018) J. Choi, P. Chuang, Z. Wang, S. Venkataramani, V. Srinivasan, and K. Gopalakrishnan. Bridging the accuracy gap for 2bit quantized neural networks (QNN). Proc. Computer Vision and Pattern Recognition, 2018.
 Courbariaux et al. (2015) M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. CoRR, 2015. URL https://arxiv.org/abs/1511.00363.
 Courbariaux et al. (2016) M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, 2016. URL https://arxiv.org/abs/1602.02830.
 Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A largescale hierarchical image database. Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 248–255, 2009. URL http://www.imagenet.org.
 Denton et al. (2014) E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and B. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. Proc. Neural Information Processing Systems, 2014.
 Devarakonda et al. (2017) A. Devarakonda, M. Naumov, and M. Garland. Adabatch: Adaptive batch sizes for training deep neural networks. CoRR, 2017. URL https://arxiv.org/abs/1712.02029.
 Dolbeau (2018) R. Dolbeau. Theoretical peak flops per instruction set: A tutorial. The Journal of Supercomputing, 74:1341–1377, 2018.
 Gong et al. (2014) Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. CoRR, 2014. URL https://arxiv.org/abs/1412.6115.
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. URL https://www.deeplearningbook.org.
 He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, 2015. URL http://arxiv.org/abs/1512.03385.

Hubara et al. (2018)
I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio.
Quantized neural networks: Training neural networks with low
precision weights and activations.
Journal of Machine Learning Research
, 18:1–30, 2018.  Humrick (2017) M. Humrick. Exploring DynamIQ and ARM’s new CPUs: CortexA75, CortexA55. AnandTech, 2017. URL https://www.anandtech.com/show/11441/dynamiqandarmsnewcpuscortexa75a55.
 Hung et al. (2015) P. Hung, C. Lee, S. Yang, V. S. Somayazulu, Y. Chen, and S. Chien. Bridge deep learning to the physical world: An efficient method to quantize network. IEEE Signal Processing Systems, 2015.
 Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. International Conf. Machine Learning, pp. 448–456, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
 Jacob et al. (2017) B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. CoRR, 2017. URL https://arxiv.org/abs/1712.05877.
 Jaderberg et al. (2014) M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. BMVC, 2014.
 Jia et al. (2014) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, 2014.
 Kim et al. (2017) D. Kim, J. Ahn, and S. Yoo. ZeNA: Zeroaware neural network accelerator. IEEE Design and Test, 35:39–46, 2017.
 Krishnamoorthi (2018) R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR, 2018.
 Krizhevsky et al. (2009) A. Krizhevsky, V. Nair, and G. Hinton. CIFAR10 (Canadian Institute for Advanced Research). 2009. URL http://www.cs.toronto.edu/˜kriz/cifar.html.
 Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E Hinton. ImageNet classification with deep convolutional neural networks. Advances Neural Information Processing Systems, pp. 1097–1105, 2012.
 LeCun et al. (1989a) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989a.
 LeCun et al. (1989b) Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard. Handwritten digit recognition: Applications of neural net chips and automatic learning. IEEE Communication, pp. 41–46, 1989b.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998. URL http://yann.lecun.com/exdb/mnist.
 Li et al. (2016) F. Li, B. Zhang, and B. Liu. Ternary weight networks. CoRR, 2016. URL https://arxiv.org/abs/1605.04711.
 Mellempudi et al. (2017) N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with finegrained quantization. CoRR, 2017. URL https://arxiv.org/abs/1705.01462.

(30)
S. Migasz.
8bit inference with Tensor RT.
GTC 2017. URL http://ondemand.gputechconf.com/gtc/2017/presentation/s73108bitinferencewithtensorrt.pdf.  Mishra & Marr (2018) A. Mishra and D. Marr. Apprentice: Using knowledge distillation techniques to improve lowprecision network accuracy. CoRR, 2018. URL https://arxiv.org/abs/1711.05852.

NVIDIA (2018)
NVIDIA.
Turing architecture whitepaper.
2018.
URL
https://www.nvidia.com/content/dam/enzz/Solutions/design
visualization/technologies/turingarchitecture/NVIDIATuringArchitectureWhitepaper.pdf.  Parashar et al. (2017) A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An accelerator for compressedsparse convolutional neural networks. ACM/IEEE 44th Annual International Symposium on Computer Architecture, 2017.
 Park et al. (2018) E. Park, S. Yoo, and P. Vajda. Valueaware quantization for training and inference of neural networks. CoRR, 2018. URL https://arxiv.org/abs/1804.07802.
 Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. Proc. Neural Information Processing Systems, 2017.
 Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: ImageNet classification using binary convolutional neural networks. CoRR, 2016. URL https://arxiv.org/abs/1603.05279.
 Reagen et al. (2016) B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernandezLobato, G.Y. Wei, and D. Brooks. Minerva: Enabling lowpower, highlyaccurate deep. ACM/IEEE 43rd Annual International Symposium on Computer Architecture, 2016.
 Shimpi (2011) A. L. Shimpi. ARM’s Cortex A7: Bringing cheaper dualcore & more power efficient highend devices. AnandTech, 2011. URL https://www.anandtech.com/show/4991/armscortexa7bringingcheaperdualcoremorepowerefficient//highenddevices.
 Stenger (1993) F. Stenger. Numerical methods based on sinc and analytic functions. Springer Series in Computational Mathematics, 20, 1993.
 Strang & Fix (2008) G. Strang and G. Fix. An Analysis of the Finite Element Method. WellesleyCambridge Press, 2nd Ed., 2008.
 Szegedy et al. (2014) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, 2014. URL https://arxiv.org/abs/1409.4842.
 Venkatesh et al. (2016) G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using lowprecision and sparsity. CoRR, 2016. URL https://arxiv.org/abs/1610.00324.
 Wang et al. (2018) P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng. Twostep quantization for lowbit neural networks. Proc. Computer Vision and Pattern Recognition, 2018.
 Wen et al. (2016) H. Wen, S. Zhou, Z. Liang, Y. Zhang, D. Feng, X. Zhou, and C. Yao. Training bit fully convolutional network for fast semantic segmentation. CoRR, 2016. URL https://arxiv.org/abs/1612.00212.
 Zhou et al. (2017) Y. Zhou, S.M. MoosaviDezfooli, N.M. Cheung, and P. Frossard. Adaptive quantization for deep neural network. CoRR, 2017. URL https://arxiv.org/abs/1712.01048.
5 Appendix: Brief Background
The machine learning models are used in the fields of computer vision (CV) and natural language processing (NLP) among many others. In particular, the deep learning models based on neural networks composed of multiple layers have achieved unprecedented gains in accuracy of image classification and object detection tasks
Krizhevsky et al. (2012); Szegedy et al. (2014); LeCun et al. (1998); Krizhevsky et al. (2009); Deng et al. (2009).In this paper we focus on the CV deep learning models that often rely on convolutional neural networks (CNNs), that are mainly composed of multiple convolution, fully connected and batch normalization layers LeCun et al. (1989a, b); Goodfellow et al. (2016); Ioffe & Szegedy (2015). For example, we will investigate ResNet20 on CIFAR10 and ResNet50 on ImageNet datasets He et al. (2015); Krizhevsky et al. (2009); Deng et al. (2009). For completeness we review the most common layers next.
The convolution layer can be defined as
(14) 
where input image , the filter , while denotes a convolution^{2}^{2}2In this context it is also common to use a crosscorrelation rather than a convolution. and denotes a pooling operation, resulting in output with and for strides . The operation is usually repeated for output channels, resulting in .
The fully connected layer is defined as
(15) 
where input , weights , bias
, unit vector
, the nonlinear activation function is applied componentwise on intermediate and output for batch size.The typical batch normalization layer can be written as
(16) 
where input , scaled diagonal matrix of weights , bias , unit vector for batch size Devarakonda et al. (2017).