Though deep neural networks (DNNs) have established themselves as powerful predictive models achieving human-level accuracy on many machine learning tasks(He et al., 2016), their excellent performance has been achieved at the expense of a very high computational and parameter complexity. For instance, AlexNet (Krizhevsky et al., 2012) requires over multiply-accumulates (MACs) per image and has 60 million parameters, while Deepface (Taigman et al., 2014) requires over MACs/image and involves more than 120 million parameters. DNNs’ enormous computational and parameter complexity leads to high energy consumption (Chen et al., 2017), makes their training via the stochastic gradient descent (SGD) algorithm very slow often requiring hours and days (Goyal et al., 2017), and inhibits their deployment on energy and resource-constrained platforms such as mobile devices and autonomous agents.
A fundamental problem contributing to the high computational and parameter complexity of DNNs is their realization using 32-b floating-point (FL) arithmetic in GPUs and CPUs. Reduced-precision representations such as quantized FL (QFL) and fixed-point
(FX) have been employed in various combinations to both training and inference. Many employ FX during inference but train in FL, e.g., fully binarized neural networks(Hubara et al., 2016) use 1-b FX in the forward inference path but the network is trained in 32-b FL. Similarly, Gupta et al. (2015) employs 16-b FX for all tensors except for the internal accumulators which use 32-b FL, and 3-level QFL gradients were employed (Wen et al., 2017; Alistarh et al., 2017) to accelerate training in a distributed setting. Note that while QFL reduces storage and communication costs, it does not reduce the computational complexity as the arithmetic remains in 32-b FL.
Thus, none of the previous works address the fundamental problem of realizing true fixed-point DNN training, i.e., an SGD algorithm in which all parameters/variables and all computations are implemented in FX with minimum precision required to guarantee the network’s inference/prediction accuracy and training convergence. The reasons for this gap are numerous including: 1) quantization errors propagate to the network output thereby directly affecting its accuracy (Lin et al., 2016); 2) precision requirements of different variables in a network are interdependent and involve hard-to-quantify trade-offs (Sakr et al., 2017); 3) proper quantization requires the knowledge of the dynamic range which may not be available (Pascanu et al., 2013); and 4) quantization errors may accumulate during training and can lead to stability issues (Gupta et al., 2015).
Our work makes a major advance in closing this gap by proposing a systematic methodology to obtain close-to-minimum per-layer precision requirements of an FX network that guarantees statistical similarity with full precision training. In particular, we jointly address the challenges of quantization noise, inter-layer and intra-layer precision trade-offs, dynamic range, and stability. As in (Sakr et al., 2017), we do assume that a fully-trained baseline FL network exists and one can observe its learning behavior. While, in principle, such assumption requires extra FL computation prior to FX training, it is to be noted that much of training is done in FL anyway. For instance, FL training is used in order to establish benchmarking baselines such as AlexNet (Krizhevsky et al., 2012), VGG-Net (Simonyan and Zisserman, 2014), and ResNet (He et al., 2016), to name a few. Even if that is not the case, in practice, this assumption can be accounted for via a warm-up FL training on a small held-out portion of the dataset (Dwork et al., 2015).
Applying our methodology to three benchmarks reveals several lessons. First and foremost, our work shows that it is possible to FX quantize all variables including back-propagated gradients even though their dynamic range is unknown (Köster et al., 2017). Second, we find that the per-layer weight precision requirements decrease from the input to the output while those of the activation gradients and weight accumulators increase. Furthermore, the precision requirements for residual networks are found to be uniform across layers. Finally, hyper-precision reduction techniques such as weight and activation binarization (Hubara et al., 2016) or gradient ternarization (Wen et al., 2017) are not as efficient as our methodology since these do not address the fundamental problem of realizing true fixed-point DNN training.
We demonstrate FX training on three deep learning benchmarks (CIFAR-10, CIFAR-100, SVHN) achievinghigh fidelity to our FL baseline in that we observe no loss of accuracy higher then 0.56% in all of our experiments. Our precision assignment is further shown to be within 1-b per-tensor of the minimum. We show that our precision assignment methodology reduces representational, computational, and communication costs of training by up to 6, 8, and 4, respectively, compared to the FL baseline and related works.
2 Problem Setup, Notation, and Metrics
We consider a -layer DNN deployed on a -class classification task using the setup in Figure 1. We denote the precision configuration as the matrix whose row consists of the precision (in bits) of weight (), activation (), weight gradient (), activation gradient (), and internal weight accumulator () tensors at layer . This DNN quantization setup is summarized in Appendix A.
2.1 Fixed-point Constraints & Definitions
We present definitions/constraints related to fixed-point arithmetic based on the design of fixed-point adaptive filters and signal processing systems (Parhi, 2007):
A signed fixed-point scalar with precision and binary representation is equal to: , where is the predetermined dynamic range (PDR) of . The PDR is constrained to be a constant power of 2 to minimize hardware overhead.
An unsigned fixed-point scalar with precision and binary representation is equal to:
A fixed-point scalar is called normalized if .
The precision is determined as: , where is the quantization step size which is the value of the least significant bit (LSB).
An additive model for quantization is assumed: , where is the fixed-point number obtained by quantizing the floating-point scalar ,
, and the quantization noise variance is. The notion of quantization noise is most useful when there is limited knowledge of the distribution of .
The relative quantization bias is the offset: , where the first unbiased quantization level and . The notion of quantization bias is useful when there is some knowledge of the distribution of .
The reflected quantization noise variance from a tensor to a scalar , for an arbitrary function , is : , where is the quantization step of and is the quantization noise gain from to .
The clipping rate of a tensor
is the probability:, where is the PDR of .
2.2 Complexity Metrics
We use a set of metrics inspired by those introduced by Sakr et al. (2017) which have also been used by Wu et al. (2018a). These metrics are algorithmic in nature which makes them easily reproducible.
Representational Cost for weights () and activations ():
which equals the total number of bits needed to represent the weights, weight gradients, and internal weight accumulators (), and those for activations and activation gradients (). 111We use the notation to denote the number of elements in tensor . Unquantized tensors are assumed to have a 32-b FL representation, which is the single-precision in a GPU.
Computational Cost of training: where is the dimensionality of the dot product needed to compute one output activation at layer . This cost is a measure of the number of 1-b full adders (FAs) utilized for all multiplications in one back-prop iteration. 222 When considering 32-b FL multiplications, we ignore the cost of exponent addition thereby favoring the FL (conventional) implementation. Boundary effects (in convolutions) are neglected.
3 Precision Assignment Methodology and Analysis
We aim to obtain a minimal or close-to-minimal precision configuration of a FX network such that the mismatch probability between its predicted label () and that of an associated FL network () is bounded, and the convergence behavior of the two networks is similar.
Hence, we require that: (1) all quantization noise sources in the forward path contribute identically to the mismatch budget (Sakr et al., 2017), (2) the gradients be properly clipped in order to limit the dynamic range (Pascanu et al., 2013), (3) the accumulation of quantization noise bias in the weight updates be limited (Gupta et al., 2015), (4) the quantization noise in activation gradients be limited as these are back-propagated to calculate the weight gradients, and (5) the precision of weight accumulators should be set so as to avoid premature stoppage of convergence (Goel and Shanbhag, 1998). The above insights can be formally described via the following five quantization criteria.
Equalizing Feedforward Quantization Noise (EFQN) Criterion. The reflected quantization noise variances onto the mismatch probability from all feedforward weights () and activations () should be equal:
Gradient Clipping (GC) Criterion. The clipping rates of weight () and activation () gradients should be less than a maximum value :
Relative Quantization Bias (RQB) Criterion. The relative quantization bias of weight gradients () should be less than a maximum value :
Back-propagated Quantization Noise (BQN) Criterion. The reflected quantization noise variance , i.e., the total sum of element-wise variances of reflected from quantizing , should be less than :
where is the total sum of element-wise variances of .
Accumulator Stopping (AS) Criterion. The quantization noise of the internal accumulator should be zero, equivalently:
where is the reflected quantization noise variance from to , its total sum of element-wise variances.
Further explanations and motivations behind the above criteria are presented in Appendix B. The following claim ensures the satisfiability of the above criteria. This leads to closed form expressions for the precision requirements we are seeking and completes our methodology. The validity of the claim is proved in Appendix C.
Satisfiability of Quantization Criteria. The five quantization criteria (EFQN, GC, RQB, BQN, AS) are satisfied if:
The precisions and are set as follows:
for , where denotes the rounding operation, and are the weight and activation quantization noise gains at layer , respectively, is a reference minimum precision, and .
The weight and activation gradients quantization step sizes are upper bounded as follows:
where is the smallest recorded estimate of and
is the largest singular value of the square-Jacobian (Jacobian matrix with squared entries) ofwith respect to .
The accumulator PDR and step size satisfy:
where is the smallest value of the learning rate used during training.
Practical considerations: Note that one of the feedforward precisions will equal . The formulas to compute the quantization noise gains are given in Appendix C and require only one forward-backward pass on an estimation set. We would like the EFQN criterion to hold upon convergence; hence, (1) is computed using the converged model from the FL baseline. For backward signals, setting the values of PDR and LSB is sufficient to determine the precision using the identity , as explained in Section 2.1. As per Claim 1, estimates of the second order statistics, e.g., and , of the gradient tensors, are required. These are obtained via tensor spatial averaging, so that one estimate per tensor is required, and updated in a moving window fashion, as is done for normalization parameters in BatchNorm (Ioffe and Szegedy, 2015). Furthermore, it might seem that computing the Jacobian in (3) is a difficult task; however, the values of its elements are already computed by the back-prop algorithm, requiring no additional computations (see Appendix C
). Thus, the Jacobians (at different layers) are also estimated during training. Due to the typical very large size of modern neural networks, we average the Jacobians spatially, i.e., the activations are aggregated across channels and mini-batches while weights are aggregated across filters. This is again inspired by the work on Batch Normalization(Ioffe and Szegedy, 2015) and makes the probed Jacobians much smaller.
4 Numerical Results
We conduct numerical simulations in order to illustrate the validity of the predicted precision configuration and investigate its minimality and benefits. We employ three deep learning benchmarking datasets: CIFAR-10, CIFAR-100 (Krizhevsky and Hinton, 2009), and SVHN (Netzer et al., 2011). All experiments were done using a Pascal P100 NVIDIA GPU. We train the following networks:
SVHN ConvNet: the same network as the CIFAR-10 ConvNet, but trained on the SVHN dataset.
CIFAR-100 ResNet: same network as CIFAR-10 ResNet save for the last layer to match the number of classes (100) in CIFAR-100.
A step by step description of the application of our method to the above four networks is provided in Appendix E. We hope the inclusion of these steps would: (1) clarify any ambiguity the reader may have from the previous section and (2) facilitate the reproduction of our results.
4.1 Precision Configuration & Convergence
represents the average number of bits per tensor type. For the ResNets, layer depths 21 and 22 correspond to the strided convolutions in the shortcut connections of residual blocks 4 and 7, respectively. Activation gradients go from layer 2 toand are “shifted to the left” in order to be aligned with the other tensors.
The precision configuration , with target , , and , via our proposed method is depicted in Figure 2 for each of the four networks considered. We observe that is dependent on the network type. Indeed, the precisions of the two ConvNets follow similar trends as do those the two ResNets. Furthermore, the following observations are made for the ConvNets:
weight precision decreases as depth increases. This is consistent with the observation that weight perturbations in the earlier layers are the most destructive (Raghu et al., 2017).
the precisions of activation gradients () and internal weight accumulators () increases as depth increases which we interpret as follows: (1) the back-propagation of gradients is the dual of the forward-propagation of activations, and (2) accumulators store the most information as their precision is the highest.
the precisions of the weight gradients () and activations () are relatively constant across layers.
Interestingly, for ResNets, the precision is mostly uniform across the layers. Furthermore, the gap between and the other precisions is not as pronounced as in the case of ConvNets. This suggests that information is spread equally among all signals which we speculate is due to the shortcut connections preventing the shattering of information (Balduzzi et al., 2017).
FX training curves in Figure 3 indicate that leads to convergence and consistently track FL curves with close fidelity. This validates our analysis and justifies the choice of .
4.2 Near Minimality of
To determine that is a close-to-minimal precision assignment, we compare it with: (a) and (b) where is an matrix with each entry equal to 1333PDRs are unchanged across configurations, except for as per (4)., i.e., we perturb by 1-b in either direction. Figure 3 also contains the convergence curves for the two new configurations. As shown, always results in a noticeable gap compared to
for both the loss function (except for the CIFAR-10 ResNet) and the test error. Furthermore,offers no observable improvements over (except for the test error of CIFAR-10 ConvNet). These results support our contention that is close-to-minimal in that increasing the precision above leads to diminishing returns while reducing precision below leads to a noticeable degradation in accuracy. Additional experimental results provided in Appendix D support our contention regarding the near minimality of . Furthermore, by studying the impact of quantizing specific tensors we determine that that the accuracy is most sensitive to the precision assigned to weights and activation gradients.
|CIFAR-10 ConvNet||SVHN ConvNet|
|CIFAR-10 ResNet||CIFAR-100 ResNet|
4.3 Complexity vs. Accuracy
We would like to quantify the reduction in training cost and expense in terms of accuracy resulting from our proposed method and compare them with those of other methods. Importantly, for a fair comparison, the same network architecture and training procedure are used. We report , , , , and test error, for each of the four networks considered for the following training methods:
baseline FL training and FX training using ,
fixed-point training with stochastic quantization (SQ). As was done in (Gupta et al., 2015), we quantize feedforward weights and activations as well as all gradients, but accumulators are kept in floating-point. The precision configuration (excluding accumulators) is inherited from (hence we determine exactly how much stochastic quantization helps),
training with ternarized gradients (TG) as was done in TernGrad (Wen et al., 2017). All computations are done in floating-point but weight gradients are ternarized according to the instantaneous tensor spatial standard deviations as was suggested by Wen et al. (2017). To compute costs, we assume all weight gradients use two bits although they are not really fixed-point and do require computation of 32-b floating-point scalars for every tensor.
The comparison is presented in Table 1. The first observation is a massive complexity reduction compared to FL. For instance, for the CIFAR-10 ConvNet, the complexity reduction is (), (), (), and () for , , , and , respectively. Similar trends are observed for the other four networks. Such complexity reduction comes at the expense of no more than 0.56% increase in test error. For the CIFAR-100 network, the accuracy when training in fixed-point is even better than that of the baseline.
The representational and communication costs of BN is significantly greater than that of FX because the gradients and accumulators are kept in full precision, which masks the benefits of binarizing feedforward tensors. However, benefits are noticeable when considering the computational cost which is lowest as binarization eliminates multiplications. Furthermore, binarization causes a severe accuracy drop for the ConvNets but surprisingly not for the ResNets. We speculate that this is due to the high dimensional geometry of ResNets (Anderson and Berg, 2017).
As for SQ, since was inherited, all costs are identical to FX, save for which is larger due to full precision accumulators. Furthermore, SQ has a positive effect only on the CIFAR-10 ConvNet where it clearly acted as a regularizer.
TG does not provide complexity reductions in terms of representational and computational costs which is expected as it only compresses weight gradients. Additionally, the resulting accuracy is slightly worse than that of all other considered schemes, including FX. Naturally, it has the lowest communication cost as weight gradients are quantized to just 2-b.
5.1 Related Works
Many works have addressed the general problem of reduced precision/complexity deep learning.
Reducing the complexity of inference (forward path): several research efforts have addressed the problem of realizing a DNN’s inference path in FX. For instance, the works in (Lin et al., 2016; Sakr et al., 2017) address the problem of precision assignment. While Lin et al. (2016) proposed a non-uniform precision assignment using the signal-to-quantization-noise ratio (SQNR) metric, Sakr et al. (2017) analytically quantified the trade-off between activation and weight precisions while providing minimal precision requirements of the inference path computations that bounds the probability of a mismatch between predicted labels of the FX and its FL counterpart. An orthogonal approach which can be applied on top of quantization is pruning (Han et al., 2015). While significant inference efficiency can be achieved, this approach incurs a substantial training overhead. A subset of the FX training problem was addressed in binary weighted neural networks (Courbariaux et al., 2015; Rastegari et al., 2016) and fully binarized neural networks (Hubara et al., 2016), where direct training of neural networks with pre-determined precisions in the inference path was explored with the feedback path computations being done in 32-b FL.
Reducing the complexity of training (backward path): finite-precision training was explored in (Gupta et al., 2015) which employed stochastic quantization in order to counter quantization bias accumulation in the weight updates. This was done by quantizing all tensors to 16-b FX, except for the internal accumulators which were stored in a 32-b floating-point format. An important distinction our work makes is the circumvention of the overhead of implementing stochastic quantization (Hubara et al., 2016). Similarly, DoReFa-Net (Zhou et al., 2016) stores internal weight representations in 32-b FL, but quantizes the remaining tensors more aggressively. Thus arises the need to re-scale and re-compute in floating-point format, which our work avoids. Finally, Köster et al. (2017) suggests a new number format – Flexpoint – and were able to train neural networks using slightly 16-b per tensor element, with 5 shared exponent bits and a per-tensor dynamic range tracking algorithm. Such tracking causes a hardware overhead bypassed by our work since the arithmetic is purely FX. Augmenting Flexpoint with stochastic quantization effectively results in WAGE (Wu et al., 2018b), and enables integer quantization of each tensor.
As seen above, none of the prior works address the problem of predicting precision requirements of all training signals. Furthermore, the choice of precision is made in an ad-hoc manner. In contrast, we propose a systematic methodology to determine close-to-minimal precision requirements for FX-only training of deep neural networks.
In this paper, we have presented a study of precision requirements in a typical back-propagation based training procedure of neural networks. Using a set of quantization criteria, we have presented a precision assignment methodology for which FX training is made statistically similar to the FL baseline, known to converge a priori. We realized FX training of four networks on the CIFAR-10, CIFAR-100, and SVHN datasets and quantified the associated complexity reduction gains in terms costs of training. We also showed that our precision assignment is nearly minimal.
The presented work relies on the statistics of all tensors being quantized during training
. This necessitates an initial baseline run in floating-point which can be costly. An open problem is to predict a suitable precision configuration by only observing the data statistics and the network architecture. Future work can leverage the analysis presented in this paper to enhance the effectiveness of other network complexity reduction approaches. For instance, weight pruning can be viewed as a coarse quantization process (quantize to zero) and thus can potentially be done in a targeted manner by leveraging the information provided by noise gains. Furthermore, parameter sharing and clustering can be viewed as a form of vector quantization which presents yet another opportunity to leverage our method for complexity reduction.
This work was supported in part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.
- Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. (2017). Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718.
- Anderson and Berg (2017) Anderson, A. G. and Berg, C. P. (2017). The high-dimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199.
- Balduzzi et al. (2017) Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W.-D., and McWilliams, B. (2017). The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, pages 342–350.
- Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
- Chen et al. (2017) Chen, Y.-H., Krishna, T., Emer, J. S., and Sze, V. (2017). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138.
- Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131.
- Dwork et al. (2015) Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. (2015). The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638.
- Goel and Shanbhag (1998) Goel, M. and Shanbhag, N. (1998). Finite-precision analysis of the pipelined strength-reduced adaptive filter. Signal Processing, IEEE Transactions on, 46(6):1763–1769.
- Goodfellow et al. (2013) Goodfellow, I. J. et al. (2013). Maxout networks. ICML (3), 28:1319–1327.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
- Gupta et al. (2015) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning with limited numerical precision. In Proceedings of The 32nd International Conference on Machine Learning, pages 1737–1746.
- Han et al. (2015) Han, S., Mao, H., and Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In , pages 770–778.
- Hubara et al. (2016) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107–4115.
- Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456.
- Köster et al. (2017) Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K., Constable, W., Elibol, O., Hall, S., Hornof, L., Khosrowshahi, A., et al. (2017). Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems, pages 1740–1750.
- Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105.
- Lin et al. (2016) Lin, D., Talathi, S., and Annapureddy, S. (2016). Fixed point quantization of deep convolutional networks. In Proceedings of The 33rd International Conference on Machine Learning, pages 2849–2858.
- Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5.
- Parhi (2007) Parhi, K. (2007). VLSI Digital Signal Processing Systems: Design and Implementation. John Wiley & Sons.
Pascanu et al. (2013)
Pascanu, R., Mikolov, T., and Bengio, Y. (2013).
On the difficulty of training recurrent neural networks.In International Conference on Machine Learning, pages 1310–1318.
- Raghu et al. (2017) Raghu, M. et al. (2017). On the expressive power of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 2847–2854.
- Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). XNOR-Net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer.
- Sakr et al. (2017) Sakr, C., Kim, Y., and Shanbhag, N. (2017). Analytical guarantees on numerical precision of deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 3007–3016.
- Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Srivastava et al. (2014) Srivastava, N. et al. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
- Taigman et al. (2014) Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708.
- Tyurin (2010) Tyurin, I. S. (2010). Refinement of the upper bounds of the constants in lyapunov’s theorem. Russian Mathematical Surveys, 65(3):586–588.
- Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. (2017). Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518.
Wu et al. (2018a)
Wu, J., Wang, Y., Wu, Z., Wang, Z., Veeraraghavan, A., and Lin, Y. (2018a).
Deep k-means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions.In International Conference on Machine Learning, pages 5359–5368.
- Wu et al. (2018b) Wu, S., Li, G., Chen, F., and Shi, L. (2018b). Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680.
- Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
- Zhou et al. (2016) Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. (2016). DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160.
Appendix A Summary of Quantization Setup
The quantization setup depicted in Figure 1 is summarized as follows:
Feedforward computation at layer :
where is the function implemented at layer , () is the activation tensor at layer () quantized to a normalized unsigned fixed-point format with precision (), and is the weight tensor at layer quantized to a normalized signed fixed-point format with precisionat every iteration.
Back-propagation of activation gradients at layer :
where is the function that back-propagates the activation gradients at layer , () is the activation gradient tensor at layer () quantized to a signed fixed-point format with precision ().
Back-propagation of weight gradient tensor at layer :
where is the function that back-propagates the weight gradients at layer , and is quantized to a signed fixed-point format with precision .
Internal weight accumulator update at layer :
where is the update function, is the learning rate, and is the internal weight accumulator tensor at layer quantized to signed fixed-point with precision . Note that, for the next iteration, is directly obtained from via quantization to bits.
Appendix B Further Explanations and Motivations behind Quantization Criteria
Criterion 1 (EFQN) is used to ensure that all feedforward quantization noise sources contribute equally to the budget. Indeed, if one of the reflected quantization noise variances from the feedforward tensors onto , say for , largely dominates all others, it would imply that all tensors but are overly quantized. It would therefore be necessary to either increase the precision of or decrease the precisions of all other tensors. The application of Criterion 1 (EFQN) through the closed form expression (1) in Claim 1 solves this issue avoiding the need for a trial-and-error approach.
Because FX numbers require a constant PDR, clipping of gradients is needed since their dynamic range is arbitrary. Ideally, a very small PDR would be preferred in order to obtain quantization steps of small magnitude, and hence less quantization noise. We can draw parallels from signal processing theory, where it is known that for a given quantizer, the signal-to-quantization-noise ratio (SQNR) is equal to where is the peak-to-average ratio, proportional to the PDR. Thus, we would like to reduce the PDR as much as possible in order to increase the SQNR for a given precision. However, this comes at the risk of overflows (due to clipping). Criterion 2 (GC) addresses this trade-off between quantization noise and overflow errors.
Since the back-propagation training procedure is an iterative one, it is important to ensure that any form of bias does not corrupt the weight update accumulation in a positive feedback manner. FX quantization, being a uniform one, is likely to induce such bias when quantized quantities, most notable gradients, are not uniformly distributed. Criterion 3 (RQB) addresses this issue by using as proxy to this bias accumulation a function of quantization step size and ensuring that its worst case value is small in magnitude.
Criterion 4 (BQN) is in fact an extension of Criterion 1 (EFQN), but for the back-propagation phase. Indeed, once the precision (and hence quantization noise) of weight gradients is set as per Criterion 3 (RQB), it is needed to ensure that the quantization noise source at the activation gradients would not contribute more noise to the updates. This criterion sets the quantization step of the activation gradients.
Criterion 5 (AS) ties together feedforward and gradient precisions through the weight accumulators. It is required to increment/decrement the feedforward weights whenever the accumulated updates cross-over the weight quantization threshold. This is used to set the PDR of the weight accumulators. Furthermore, since the precision of weight gradients has already been designed to account for quantization noise (through Criteria 2-4), the criterion requires that the accumulators do not cause additional noise.
Appendix C Proof of Claim 1
The validity of Claim 1 is derived from the following five lemmas. Note that each lemma addresses the satisfiability of one of the five quantization criteria presented in the main text and corresponds to part of Claim 1.
The EFQN criterion holds if the precisions and are set as follows:
for , where denotes the rounding operation, is a reference minimum precision, and is given by:
By definition of the reflected quantization noise variance, the EFQN, by definition, is satisfied if:
where the quantization noise gains are given by:
for , where are the soft outputs and is the soft output corresponding to . The expressions for these quantization gains are obtained by linearly expanding (across layers) those used in (Sakr et al., 2017). Note that a second order upper bound is used as a surrogate expression for .
From the definition of quantization step size, the above is equivalent to:
Let be as defined in (5):
We can divide each term by :
where each term is positive, so that we can take square roots and logarithms such that:
Thus we equate all of the above to a reference precision yielding:
for . Note that because is the least quantization noise gain, it is equal to one of the above quantization noise gains so that the corresponding precision actually equates . As precisions must be integer valued, each of , , and have to be integers, and thus a rounding operation is to be applied on all logarithm terms. Doing so results in (1) from Lemma 1 which completes this proof. ∎
The GC criterion holds for provided the weight and activation gradients pre-defined dynamic ranges (PDRs) are lower bounded as follows:
where and are the largest ever recorded estimates of the weight and activation gradients standard deviations and , respectively.
Let us consider the case of weight gradients. The GC criterion, by definition requires:
Typically, weight gradients are obtained by computing the derivatives of a loss function with respect to a mini-batch. By linearity of derivatives, weight gradients are themselves averages of instantaneous derivatives and are hence expected to follow a Gaussian distribution by application of the Central Limit Theorem. Furthermore, the gradient mean was estimated during baseline training and was found to oscillate around zero.
where we used the fact that a Gaussian distribution is symmetric and is the elementary Q-function, which is a decreasing function. Thus, in the worst case, we have:
Hence, for a PDR as suggested by the lower bound in (2):
in Lemma 2, we obtain the upper bound:
which means the GC criterion holds and completes the proof.
For activation gradients, the same reasoning applies, but the choice of a larger PDR in (2):
than for weight gradients is due to the fact that the true dynamic range of the activation gradients is larger than the value indicated by the second moment. This stems from the use of activation functions such as ReLU which make the activation gradients sparse. We also recommend increasing the PDR even more when using regularizers that sparsify gradients such as Dropout(Srivastava et al., 2014) or Maxout (Goodfellow et al., 2013). ∎
The RQB criterion holds for provided the weight gradient quantization step size is upper bounded as follows:
where is the smallest ever recorded estimate of .
For the Gaussian distributed (see proof of Lemma 2) weight gradient at layer , the true mean conditioned on the first non-zero quantization region is given by:
where is the standard deviation of . By substituting into the above expression of and plugging in the definition of relative quantization bias, we obtain:
Hence, this choice of the quantization step satisfies the RQB. In order to ensure the RQB holds throughout training, is used in Lemma 3. This completes the proof. ∎
The BQN criterion holds provided the activation gradient quantization step size is upper bounded as follows:
where , the largest singular value of the square-Jacobian (Jacobian matrix with squared entries) of with respect to .
Let us unroll and to vectors of size and , respectively. The element-wise quantization noise variance of each weight gradient is . Therefore we have:
The reflected quantization noise variance from an activation gradient onto a weight gradient is
where cross products of quantization noise are neglected (Sakr et al., 2017). Hence, the reflected quantization noise variance element-wise from onto is given by:
where is the square-Jacobian of with respect to and denotes the all one vector with size denoted by its subscript. Hence, we have: