1 Introduction
Though deep neural networks (DNNs) have established themselves as powerful predictive models achieving humanlevel accuracy on many machine learning tasks
(He et al., 2016), their excellent performance has been achieved at the expense of a very high computational and parameter complexity. For instance, AlexNet (Krizhevsky et al., 2012) requires over multiplyaccumulates (MACs) per image and has 60 million parameters, while Deepface (Taigman et al., 2014) requires over MACs/image and involves more than 120 million parameters. DNNs’ enormous computational and parameter complexity leads to high energy consumption (Chen et al., 2017), makes their training via the stochastic gradient descent (SGD) algorithm very slow often requiring hours and days (Goyal et al., 2017), and inhibits their deployment on energy and resourceconstrained platforms such as mobile devices and autonomous agents.A fundamental problem contributing to the high computational and parameter complexity of DNNs is their realization using 32b floatingpoint (FL) arithmetic in GPUs and CPUs. Reducedprecision representations such as quantized FL (QFL) and fixedpoint
(FX) have been employed in various combinations to both training and inference. Many employ FX during inference but train in FL, e.g., fully binarized neural networks
(Hubara et al., 2016) use 1b FX in the forward inference path but the network is trained in 32b FL. Similarly, Gupta et al. (2015) employs 16b FX for all tensors except for the internal accumulators which use 32b FL, and 3level QFL gradients were employed (Wen et al., 2017; Alistarh et al., 2017) to accelerate training in a distributed setting. Note that while QFL reduces storage and communication costs, it does not reduce the computational complexity as the arithmetic remains in 32b FL.Thus, none of the previous works address the fundamental problem of realizing true fixedpoint DNN training, i.e., an SGD algorithm in which all parameters/variables and all computations are implemented in FX with minimum precision required to guarantee the network’s inference/prediction accuracy and training convergence. The reasons for this gap are numerous including: 1) quantization errors propagate to the network output thereby directly affecting its accuracy (Lin et al., 2016); 2) precision requirements of different variables in a network are interdependent and involve hardtoquantify tradeoffs (Sakr et al., 2017); 3) proper quantization requires the knowledge of the dynamic range which may not be available (Pascanu et al., 2013); and 4) quantization errors may accumulate during training and can lead to stability issues (Gupta et al., 2015).
Our work makes a major advance in closing this gap by proposing a systematic methodology to obtain closetominimum perlayer precision requirements of an FX network that guarantees statistical similarity with full precision training. In particular, we jointly address the challenges of quantization noise, interlayer and intralayer precision tradeoffs, dynamic range, and stability. As in (Sakr et al., 2017), we do assume that a fullytrained baseline FL network exists and one can observe its learning behavior. While, in principle, such assumption requires extra FL computation prior to FX training, it is to be noted that much of training is done in FL anyway. For instance, FL training is used in order to establish benchmarking baselines such as AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan and Zisserman, 2014), and ResNet (He et al., 2016), to name a few. Even if that is not the case, in practice, this assumption can be accounted for via a warmup FL training on a small heldout portion of the dataset (Dwork et al., 2015).
Applying our methodology to three benchmarks reveals several lessons. First and foremost, our work shows that it is possible to FX quantize all variables including backpropagated gradients even though their dynamic range is unknown (Köster et al., 2017). Second, we find that the perlayer weight precision requirements decrease from the input to the output while those of the activation gradients and weight accumulators increase. Furthermore, the precision requirements for residual networks are found to be uniform across layers. Finally, hyperprecision reduction techniques such as weight and activation binarization (Hubara et al., 2016) or gradient ternarization (Wen et al., 2017) are not as efficient as our methodology since these do not address the fundamental problem of realizing true fixedpoint DNN training.
We demonstrate FX training on three deep learning benchmarks (CIFAR10, CIFAR100, SVHN) achieving
high fidelity to our FL baseline in that we observe no loss of accuracy higher then 0.56% in all of our experiments. Our precision assignment is further shown to be within 1b pertensor of the minimum. We show that our precision assignment methodology reduces representational, computational, and communication costs of training by up to 6, 8, and 4, respectively, compared to the FL baseline and related works.2 Problem Setup, Notation, and Metrics
We consider a layer DNN deployed on a class classification task using the setup in Figure 1. We denote the precision configuration as the matrix whose row consists of the precision (in bits) of weight (), activation (), weight gradient (), activation gradient (), and internal weight accumulator () tensors at layer . This DNN quantization setup is summarized in Appendix A.
2.1 Fixedpoint Constraints & Definitions
We present definitions/constraints related to fixedpoint arithmetic based on the design of fixedpoint adaptive filters and signal processing systems (Parhi, 2007):

[noitemsep,topsep=0pt,leftmargin=*]

A signed fixedpoint scalar with precision and binary representation is equal to: , where is the predetermined dynamic range (PDR) of . The PDR is constrained to be a constant power of 2 to minimize hardware overhead.

An unsigned fixedpoint scalar with precision and binary representation is equal to:

A fixedpoint scalar is called normalized if .

The precision is determined as: , where is the quantization step size which is the value of the least significant bit (LSB).

An additive model for quantization is assumed: , where is the fixedpoint number obtained by quantizing the floatingpoint scalar ,
is a random variable uniformly distributed on the interval
, and the quantization noise variance is
. The notion of quantization noise is most useful when there is limited knowledge of the distribution of . 
The relative quantization bias is the offset: , where the first unbiased quantization level and . The notion of quantization bias is useful when there is some knowledge of the distribution of .

The reflected quantization noise variance from a tensor to a scalar , for an arbitrary function , is : , where is the quantization step of and is the quantization noise gain from to .
2.2 Complexity Metrics
We use a set of metrics inspired by those introduced by Sakr et al. (2017) which have also been used by Wu et al. (2018a). These metrics are algorithmic in nature which makes them easily reproducible.

[noitemsep,topsep=0pt,leftmargin=*]

Representational Cost for weights () and activations ():
which equals the total number of bits needed to represent the weights, weight gradients, and internal weight accumulators (), and those for activations and activation gradients (). ^{1}^{1}1We use the notation to denote the number of elements in tensor . Unquantized tensors are assumed to have a 32b FL representation, which is the singleprecision in a GPU. 
Computational Cost of training: where is the dimensionality of the dot product needed to compute one output activation at layer . This cost is a measure of the number of 1b full adders (FAs) utilized for all multiplications in one backprop iteration. ^{2}^{2}2 When considering 32b FL multiplications, we ignore the cost of exponent addition thereby favoring the FL (conventional) implementation. Boundary effects (in convolutions) are neglected.
3 Precision Assignment Methodology and Analysis
We aim to obtain a minimal or closetominimal precision configuration of a FX network such that the mismatch probability between its predicted label () and that of an associated FL network () is bounded, and the convergence behavior of the two networks is similar.
Hence, we require that: (1) all quantization noise sources in the forward path contribute identically to the mismatch budget (Sakr et al., 2017), (2) the gradients be properly clipped in order to limit the dynamic range (Pascanu et al., 2013), (3) the accumulation of quantization noise bias in the weight updates be limited (Gupta et al., 2015), (4) the quantization noise in activation gradients be limited as these are backpropagated to calculate the weight gradients, and (5) the precision of weight accumulators should be set so as to avoid premature stoppage of convergence (Goel and Shanbhag, 1998). The above insights can be formally described via the following five quantization criteria.
Criterion 1.
Equalizing Feedforward Quantization Noise (EFQN) Criterion. The reflected quantization noise variances onto the mismatch probability from all feedforward weights () and activations () should be equal:
Criterion 2.
Gradient Clipping (GC) Criterion. The clipping rates of weight () and activation () gradients should be less than a maximum value :
Criterion 3.
Relative Quantization Bias (RQB) Criterion. The relative quantization bias of weight gradients () should be less than a maximum value :
Criterion 4.
Backpropagated Quantization Noise (BQN) Criterion. The reflected quantization noise variance , i.e., the total sum of elementwise variances of reflected from quantizing , should be less than :
where is the total sum of elementwise variances of .
Criterion 5.
Accumulator Stopping (AS) Criterion. The quantization noise of the internal accumulator should be zero, equivalently:
where is the reflected quantization noise variance from to , its total sum of elementwise variances.
Further explanations and motivations behind the above criteria are presented in Appendix B. The following claim ensures the satisfiability of the above criteria. This leads to closed form expressions for the precision requirements we are seeking and completes our methodology. The validity of the claim is proved in Appendix C.
Claim 1.
Satisfiability of Quantization Criteria. The five quantization criteria (EFQN, GC, RQB, BQN, AS) are satisfied if:

[noitemsep,topsep=0pt,leftmargin=*]

The precisions and are set as follows:
(1) for , where denotes the rounding operation, and are the weight and activation quantization noise gains at layer , respectively, is a reference minimum precision, and .

The weight and activation gradients PDRs are lower bounded as follows:
(2) where and
are the largest recorded estimates of the weight and activation gradients standard deviations
and , respectively. 
The weight and activation gradients quantization step sizes are upper bounded as follows:
(3) where is the smallest recorded estimate of and
is the largest singular value of the squareJacobian (Jacobian matrix with squared entries) of
with respect to . 
The accumulator PDR and step size satisfy:
(4) where is the smallest value of the learning rate used during training.
Practical considerations: Note that one of the feedforward precisions will equal . The formulas to compute the quantization noise gains are given in Appendix C and require only one forwardbackward pass on an estimation set. We would like the EFQN criterion to hold upon convergence; hence, (1) is computed using the converged model from the FL baseline. For backward signals, setting the values of PDR and LSB is sufficient to determine the precision using the identity , as explained in Section 2.1. As per Claim 1, estimates of the second order statistics, e.g., and , of the gradient tensors, are required. These are obtained via tensor spatial averaging, so that one estimate per tensor is required, and updated in a moving window fashion, as is done for normalization parameters in BatchNorm (Ioffe and Szegedy, 2015). Furthermore, it might seem that computing the Jacobian in (3) is a difficult task; however, the values of its elements are already computed by the backprop algorithm, requiring no additional computations (see Appendix C
). Thus, the Jacobians (at different layers) are also estimated during training. Due to the typical very large size of modern neural networks, we average the Jacobians spatially, i.e., the activations are aggregated across channels and minibatches while weights are aggregated across filters. This is again inspired by the work on Batch Normalization
(Ioffe and Szegedy, 2015) and makes the probed Jacobians much smaller.4 Numerical Results
We conduct numerical simulations in order to illustrate the validity of the predicted precision configuration and investigate its minimality and benefits. We employ three deep learning benchmarking datasets: CIFAR10, CIFAR100 (Krizhevsky and Hinton, 2009), and SVHN (Netzer et al., 2011). All experiments were done using a Pascal P100 NVIDIA GPU. We train the following networks:

[noitemsep,topsep=0pt,leftmargin=*]

CIFAR10 ConvNet
: a 9layer convolutional neural network trained on the CIFAR10 dataset described as
where denotes convolutions, denotes max pooling operation, and denotes fully connected layers. 
SVHN ConvNet: the same network as the CIFAR10 ConvNet, but trained on the SVHN dataset.

CIFAR100 ResNet: same network as CIFAR10 ResNet save for the last layer to match the number of classes (100) in CIFAR100.
A step by step description of the application of our method to the above four networks is provided in Appendix E. We hope the inclusion of these steps would: (1) clarify any ambiguity the reader may have from the previous section and (2) facilitate the reproduction of our results.
4.1 Precision Configuration & Convergence
represents the average number of bits per tensor type. For the ResNets, layer depths 21 and 22 correspond to the strided convolutions in the shortcut connections of residual blocks 4 and 7, respectively. Activation gradients go from layer 2 to
and are “shifted to the left” in order to be aligned with the other tensors.The precision configuration , with target , , and , via our proposed method is depicted in Figure 2 for each of the four networks considered. We observe that is dependent on the network type. Indeed, the precisions of the two ConvNets follow similar trends as do those the two ResNets. Furthermore, the following observations are made for the ConvNets:

[noitemsep,topsep=0pt,leftmargin=*]

weight precision decreases as depth increases. This is consistent with the observation that weight perturbations in the earlier layers are the most destructive (Raghu et al., 2017).

the precisions of activation gradients () and internal weight accumulators () increases as depth increases which we interpret as follows: (1) the backpropagation of gradients is the dual of the forwardpropagation of activations, and (2) accumulators store the most information as their precision is the highest.

the precisions of the weight gradients () and activations () are relatively constant across layers.
Interestingly, for ResNets, the precision is mostly uniform across the layers. Furthermore, the gap between and the other precisions is not as pronounced as in the case of ConvNets. This suggests that information is spread equally among all signals which we speculate is due to the shortcut connections preventing the shattering of information (Balduzzi et al., 2017).
FX training curves in Figure 3 indicate that leads to convergence and consistently track FL curves with close fidelity. This validates our analysis and justifies the choice of .
4.2 Near Minimality of
To determine that is a closetominimal precision assignment, we compare it with: (a) and (b) where is an matrix with each entry equal to 1^{3}^{3}3PDRs are unchanged across configurations, except for as per (4)., i.e., we perturb by 1b in either direction. Figure 3 also contains the convergence curves for the two new configurations. As shown, always results in a noticeable gap compared to
for both the loss function (except for the CIFAR10 ResNet) and the test error. Furthermore,
offers no observable improvements over (except for the test error of CIFAR10 ConvNet). These results support our contention that is closetominimal in that increasing the precision above leads to diminishing returns while reducing precision below leads to a noticeable degradation in accuracy. Additional experimental results provided in Appendix D support our contention regarding the near minimality of . Furthermore, by studying the impact of quantizing specific tensors we determine that that the accuracy is most sensitive to the precision assigned to weights and activation gradients.











CIFAR10 ConvNet  SVHN ConvNet  
FL  148  9.3  94.4  49  12.02%  148  9.3  94.4  49  2.43%  
FX ()  56.5  1.7  11.9  14  12.58%  54.3  1.9  10.5  14  2.58%  
BN  100  4.7  2.8  49  18.50%  100  4.7  2.8  49  3.60%  
SQ  78.8  1.7  11.9  14  11.32%  76.3  1.9  10.5  14  2.73%  
TG  102  9.3  94.4  3.1  12.49%  102  9.3  94.4  3.1  3.65%  
CIFAR10 ResNet  CIFAR100 ResNet  
FL  1784  96  4319  596  7.42%  1789  97  4319  597  28.06%  
FX ()  726  25  785  216  7.51%  750  25  776  216  27.43%  
BN  1208  50  128  596  7.24%  1211  50  128  597  29.35%  
SQ  1062  25  785  216  7.42%  1081  25  776  216  28.03%  
TG  1227  96  4319  37.3  7.94%  1230  97  4319  37.3  30.62% 
4.3 Complexity vs. Accuracy
We would like to quantify the reduction in training cost and expense in terms of accuracy resulting from our proposed method and compare them with those of other methods. Importantly, for a fair comparison, the same network architecture and training procedure are used. We report , , , , and test error, for each of the four networks considered for the following training methods:

[noitemsep,topsep=0pt,leftmargin=*]

baseline FL training and FX training using ,

fixedpoint training with stochastic quantization (SQ). As was done in (Gupta et al., 2015), we quantize feedforward weights and activations as well as all gradients, but accumulators are kept in floatingpoint. The precision configuration (excluding accumulators) is inherited from (hence we determine exactly how much stochastic quantization helps),

training with ternarized gradients (TG) as was done in TernGrad (Wen et al., 2017). All computations are done in floatingpoint but weight gradients are ternarized according to the instantaneous tensor spatial standard deviations as was suggested by Wen et al. (2017). To compute costs, we assume all weight gradients use two bits although they are not really fixedpoint and do require computation of 32b floatingpoint scalars for every tensor.
The comparison is presented in Table 1. The first observation is a massive complexity reduction compared to FL. For instance, for the CIFAR10 ConvNet, the complexity reduction is (), (), (), and () for , , , and , respectively. Similar trends are observed for the other four networks. Such complexity reduction comes at the expense of no more than 0.56% increase in test error. For the CIFAR100 network, the accuracy when training in fixedpoint is even better than that of the baseline.
The representational and communication costs of BN is significantly greater than that of FX because the gradients and accumulators are kept in full precision, which masks the benefits of binarizing feedforward tensors. However, benefits are noticeable when considering the computational cost which is lowest as binarization eliminates multiplications. Furthermore, binarization causes a severe accuracy drop for the ConvNets but surprisingly not for the ResNets. We speculate that this is due to the high dimensional geometry of ResNets (Anderson and Berg, 2017).
As for SQ, since was inherited, all costs are identical to FX, save for which is larger due to full precision accumulators. Furthermore, SQ has a positive effect only on the CIFAR10 ConvNet where it clearly acted as a regularizer.
TG does not provide complexity reductions in terms of representational and computational costs which is expected as it only compresses weight gradients. Additionally, the resulting accuracy is slightly worse than that of all other considered schemes, including FX. Naturally, it has the lowest communication cost as weight gradients are quantized to just 2b.
5 Discussion
5.1 Related Works
Many works have addressed the general problem of reduced precision/complexity deep learning.
Reducing the complexity of inference (forward path): several research efforts have addressed the problem of realizing a DNN’s inference path in FX. For instance, the works in (Lin et al., 2016; Sakr et al., 2017) address the problem of precision assignment. While Lin et al. (2016) proposed a nonuniform precision assignment using the signaltoquantizationnoise ratio (SQNR) metric, Sakr et al. (2017) analytically quantified the tradeoff between activation and weight precisions while providing minimal precision requirements of the inference path computations that bounds the probability of a mismatch between predicted labels of the FX and its FL counterpart. An orthogonal approach which can be applied on top of quantization is pruning (Han et al., 2015). While significant inference efficiency can be achieved, this approach incurs a substantial training overhead. A subset of the FX training problem was addressed in binary weighted neural networks (Courbariaux et al., 2015; Rastegari et al., 2016) and fully binarized neural networks (Hubara et al., 2016), where direct training of neural networks with predetermined precisions in the inference path was explored with the feedback path computations being done in 32b FL.
Reducing the complexity of training (backward path): finiteprecision training was explored in (Gupta et al., 2015) which employed stochastic quantization in order to counter quantization bias accumulation in the weight updates. This was done by quantizing all tensors to 16b FX, except for the internal accumulators which were stored in a 32b floatingpoint format. An important distinction our work makes is the circumvention of the overhead of implementing stochastic quantization (Hubara et al., 2016). Similarly, DoReFaNet (Zhou et al., 2016) stores internal weight representations in 32b FL, but quantizes the remaining tensors more aggressively. Thus arises the need to rescale and recompute in floatingpoint format, which our work avoids. Finally, Köster et al. (2017) suggests a new number format – Flexpoint – and were able to train neural networks using slightly 16b per tensor element, with 5 shared exponent bits and a pertensor dynamic range tracking algorithm. Such tracking causes a hardware overhead bypassed by our work since the arithmetic is purely FX. Augmenting Flexpoint with stochastic quantization effectively results in WAGE (Wu et al., 2018b), and enables integer quantization of each tensor.
As seen above, none of the prior works address the problem of predicting precision requirements of all training signals. Furthermore, the choice of precision is made in an adhoc manner. In contrast, we propose a systematic methodology to determine closetominimal precision requirements for FXonly training of deep neural networks.
5.2 Conclusion
In this paper, we have presented a study of precision requirements in a typical backpropagation based training procedure of neural networks. Using a set of quantization criteria, we have presented a precision assignment methodology for which FX training is made statistically similar to the FL baseline, known to converge a priori. We realized FX training of four networks on the CIFAR10, CIFAR100, and SVHN datasets and quantified the associated complexity reduction gains in terms costs of training. We also showed that our precision assignment is nearly minimal.
The presented work relies on the statistics of all tensors being quantized during training
. This necessitates an initial baseline run in floatingpoint which can be costly. An open problem is to predict a suitable precision configuration by only observing the data statistics and the network architecture. Future work can leverage the analysis presented in this paper to enhance the effectiveness of other network complexity reduction approaches. For instance, weight pruning can be viewed as a coarse quantization process (quantize to zero) and thus can potentially be done in a targeted manner by leveraging the information provided by noise gains. Furthermore, parameter sharing and clustering can be viewed as a form of vector quantization which presents yet another opportunity to leverage our method for complexity reduction.
Acknowledgment
This work was supported in part by CBRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
References
 Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. (2017). Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718.
 Anderson and Berg (2017) Anderson, A. G. and Berg, C. P. (2017). The highdimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199.
 Balduzzi et al. (2017) Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.D., and McWilliams, B. (2017). The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, pages 342–350.
 Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
 Chen et al. (2017) Chen, Y.H., Krishna, T., Emer, J. S., and Sze, V. (2017). Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138.
 Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., and David, J.P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131.
 Dwork et al. (2015) Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. (2015). The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638.
 Goel and Shanbhag (1998) Goel, M. and Shanbhag, N. (1998). Finiteprecision analysis of the pipelined strengthreduced adaptive filter. Signal Processing, IEEE Transactions on, 46(6):1763–1769.
 Goodfellow et al. (2013) Goodfellow, I. J. et al. (2013). Maxout networks. ICML (3), 28:1319–1327.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
 Gupta et al. (2015) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning with limited numerical precision. In Proceedings of The 32nd International Conference on Machine Learning, pages 1737–1746.
 Han et al. (2015) Han, S., Mao, H., and Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J. (2016).
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778.  Hubara et al. (2016) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., and Bengio, Y. (2016). Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107–4115.
 Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456.
 Köster et al. (2017) Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K., Constable, W., Elibol, O., Hall, S., Hornof, L., Khosrowshahi, A., et al. (2017). Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems, pages 1740–1750.
 Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105.
 Lin et al. (2016) Lin, D., Talathi, S., and Annapureddy, S. (2016). Fixed point quantization of deep convolutional networks. In Proceedings of The 33rd International Conference on Machine Learning, pages 2849–2858.
 Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5.
 Parhi (2007) Parhi, K. (2007). VLSI Digital Signal Processing Systems: Design and Implementation. John Wiley & Sons.

Pascanu et al. (2013)
Pascanu, R., Mikolov, T., and Bengio, Y. (2013).
On the difficulty of training recurrent neural networks.
In International Conference on Machine Learning, pages 1310–1318.  Raghu et al. (2017) Raghu, M. et al. (2017). On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 2847–2854.
 Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). XNORNet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer.
 Sakr et al. (2017) Sakr, C., Kim, Y., and Shanbhag, N. (2017). Analytical guarantees on numerical precision of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 3007–3016.
 Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 Srivastava et al. (2014) Srivastava, N. et al. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
 Taigman et al. (2014) Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to humanlevel performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708.
 Tyurin (2010) Tyurin, I. S. (2010). Refinement of the upper bounds of the constants in lyapunov’s theorem. Russian Mathematical Surveys, 65(3):586–588.
 Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. (2017). Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518.

Wu et al. (2018a)
Wu, J., Wang, Y., Wu, Z., Wang, Z., Veeraraghavan, A., and Lin, Y. (2018a).
Deep kmeans: Retraining and parameter sharing with harder cluster assignments for compressing deep convolutions.
In International Conference on Machine Learning, pages 5359–5368.  Wu et al. (2018b) Wu, S., Li, G., Chen, F., and Shi, L. (2018b). Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680.
 Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
 Zhou et al. (2016) Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. (2016). DoReFaNet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.
Appendix A Summary of Quantization Setup
The quantization setup depicted in Figure 1 is summarized as follows:

[noitemsep,topsep=0pt,leftmargin=*]

Feedforward computation at layer :
where is the function implemented at layer , () is the activation tensor at layer () quantized to a normalized unsigned fixedpoint format with precision (), and is the weight tensor at layer quantized to a normalized signed fixedpoint format with precision
. We further assume the use of a ReLUlike activation function with a clipping level of 2 and a maxnorm constraint on the weights which are clipped between
at every iteration. 
Backpropagation of activation gradients at layer :
where is the function that backpropagates the activation gradients at layer , () is the activation gradient tensor at layer () quantized to a signed fixedpoint format with precision ().

Backpropagation of weight gradient tensor at layer :
where is the function that backpropagates the weight gradients at layer , and is quantized to a signed fixedpoint format with precision .

Internal weight accumulator update at layer :
where is the update function, is the learning rate, and is the internal weight accumulator tensor at layer quantized to signed fixedpoint with precision . Note that, for the next iteration, is directly obtained from via quantization to bits.
Appendix B Further Explanations and Motivations behind Quantization Criteria
Criterion 1 (EFQN) is used to ensure that all feedforward quantization noise sources contribute equally to the budget. Indeed, if one of the reflected quantization noise variances from the feedforward tensors onto , say for , largely dominates all others, it would imply that all tensors but are overly quantized. It would therefore be necessary to either increase the precision of or decrease the precisions of all other tensors. The application of Criterion 1 (EFQN) through the closed form expression (1) in Claim 1 solves this issue avoiding the need for a trialanderror approach.
Because FX numbers require a constant PDR, clipping of gradients is needed since their dynamic range is arbitrary. Ideally, a very small PDR would be preferred in order to obtain quantization steps of small magnitude, and hence less quantization noise. We can draw parallels from signal processing theory, where it is known that for a given quantizer, the signaltoquantizationnoise ratio (SQNR) is equal to where is the peaktoaverage ratio, proportional to the PDR. Thus, we would like to reduce the PDR as much as possible in order to increase the SQNR for a given precision. However, this comes at the risk of overflows (due to clipping). Criterion 2 (GC) addresses this tradeoff between quantization noise and overflow errors.
Since the backpropagation training procedure is an iterative one, it is important to ensure that any form of bias does not corrupt the weight update accumulation in a positive feedback manner. FX quantization, being a uniform one, is likely to induce such bias when quantized quantities, most notable gradients, are not uniformly distributed. Criterion 3 (RQB) addresses this issue by using as proxy to this bias accumulation a function of quantization step size and ensuring that its worst case value is small in magnitude.
Criterion 4 (BQN) is in fact an extension of Criterion 1 (EFQN), but for the backpropagation phase. Indeed, once the precision (and hence quantization noise) of weight gradients is set as per Criterion 3 (RQB), it is needed to ensure that the quantization noise source at the activation gradients would not contribute more noise to the updates. This criterion sets the quantization step of the activation gradients.
Criterion 5 (AS) ties together feedforward and gradient precisions through the weight accumulators. It is required to increment/decrement the feedforward weights whenever the accumulated updates crossover the weight quantization threshold. This is used to set the PDR of the weight accumulators. Furthermore, since the precision of weight gradients has already been designed to account for quantization noise (through Criteria 24), the criterion requires that the accumulators do not cause additional noise.
Appendix C Proof of Claim 1
The validity of Claim 1 is derived from the following five lemmas. Note that each lemma addresses the satisfiability of one of the five quantization criteria presented in the main text and corresponds to part of Claim 1.
Lemma 1.
The EFQN criterion holds if the precisions and are set as follows:
for , where denotes the rounding operation, is a reference minimum precision, and is given by:
(5) 
Proof.
By definition of the reflected quantization noise variance, the EFQN, by definition, is satisfied if:
where the quantization noise gains are given by:
(6) 
for , where are the soft outputs and is the soft output corresponding to . The expressions for these quantization gains are obtained by linearly expanding (across layers) those used in (Sakr et al., 2017). Note that a second order upper bound is used as a surrogate expression for .
From the definition of quantization step size, the above is equivalent to:
Let be as defined in (5):
We can divide each term by :
where each term is positive, so that we can take square roots and logarithms such that:
Thus we equate all of the above to a reference precision yielding:
for . Note that because is the least quantization noise gain, it is equal to one of the above quantization noise gains so that the corresponding precision actually equates . As precisions must be integer valued, each of , , and have to be integers, and thus a rounding operation is to be applied on all logarithm terms. Doing so results in (1) from Lemma 1 which completes this proof. ∎
Lemma 2.
The GC criterion holds for provided the weight and activation gradients predefined dynamic ranges (PDRs) are lower bounded as follows:
where and are the largest ever recorded estimates of the weight and activation gradients standard deviations and , respectively.
Proof.
Let us consider the case of weight gradients. The GC criterion, by definition requires:
Typically, weight gradients are obtained by computing the derivatives of a loss function with respect to a minibatch. By linearity of derivatives, weight gradients are themselves averages of instantaneous derivatives and are hence expected to follow a Gaussian distribution by application of the Central Limit Theorem. Furthermore, the gradient mean was estimated during baseline training and was found to oscillate around zero.
Thus
where we used the fact that a Gaussian distribution is symmetric and is the elementary Qfunction, which is a decreasing function. Thus, in the worst case, we have:
Hence, for a PDR as suggested by the lower bound in (2):
in Lemma 2, we obtain the upper bound:
which means the GC criterion holds and completes the proof.
For activation gradients, the same reasoning applies, but the choice of a larger PDR in (2):
than for weight gradients is due to the fact that the true dynamic range of the activation gradients is larger than the value indicated by the second moment. This stems from the use of activation functions such as ReLU which make the activation gradients sparse. We also recommend increasing the PDR even more when using regularizers that sparsify gradients such as Dropout
(Srivastava et al., 2014) or Maxout (Goodfellow et al., 2013). ∎Lemma 3.
The RQB criterion holds for provided the weight gradient quantization step size is upper bounded as follows:
where is the smallest ever recorded estimate of .
Proof.
For the Gaussian distributed (see proof of Lemma 2) weight gradient at layer , the true mean conditioned on the first nonzero quantization region is given by:
where is the standard deviation of . By substituting into the above expression of and plugging in the definition of relative quantization bias, we obtain:
Hence, this choice of the quantization step satisfies the RQB. In order to ensure the RQB holds throughout training, is used in Lemma 3. This completes the proof. ∎
Lemma 4.
The BQN criterion holds provided the activation gradient quantization step size is upper bounded as follows:
where , the largest singular value of the squareJacobian (Jacobian matrix with squared entries) of with respect to .
Proof.
Let us unroll and to vectors of size and , respectively. The elementwise quantization noise variance of each weight gradient is . Therefore we have:
The reflected quantization noise variance from an activation gradient onto a weight gradient is
where cross products of quantization noise are neglected (Sakr et al., 2017). Hence, the reflected quantization noise variance elementwise from onto is given by:
where is the squareJacobian of with respect to and denotes the all one vector with size denoted by its subscript. Hence, we have:
Comments
There are no comments yet.