1 Introduction
Recent advances in deep learning, and convolutional neural networks (CNN) in particular, have led to wellpublicized breakthroughs in computer vision
(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), and natural language processing (NLP)
(Bahdanau et al., 2014). Modern CNNs, however, have increasingly large storage and computational requirements (Canziani et al., 2016). This has limited the application scope to data centres that can accommodate clusters of massively parallel hardware accelerators, such as graphics processing units (GPUs). Still, GPU training of CNNs on large datasets like ImageNet
(Deng et al., 2009) can take hours, even on networks with hundreds of GPUs, and achieving linear scaling beyond these sizes is difficult (Goyal et al., 2017). As such, there is a growing interest in investigating more finegrained optimizations, especially since deployment on embedded devices with limited power, compute, and memory budgets remains an imposing challenge.Research efforts to reduce model size and speed up inference have shown that training networks with binary or ternary weights and activations (Courbariaux and Bengio, 2016; Rastegari et al., 2016; Li et al., 2016a) can achieve comparable accuracy to full precision networks, while benefiting from reduced memory requirements and improved computational efficiency using bit operations. They may even confer additional robustness to adversarial attacks (Galloway et al., 2018). More recently, the DoReFaNet model has generalized this finding to include different precision settings for weights vs. activations, and demonstrated how low precision gradients can be also employed at training time (Zhou et al., 2016).
These findings suggest that precision in deep learning is not an arbitrary design choice, but rather a dial that controls the tradeoff between model complexity and accuracy. However, precision is typically considered at the design level of an entire model, making it difficult to consider as a tunable hyperparameter. We posit that considering precision at a finer granularity, such as a layer or even perexample could grant models more flexibility in which to find optimal configurations, which maximizes accuracy and minimizes computational cost. To remain deterministic about hardware efficiency, we aim to do this for fixed budgets of precision, which have predictable acceleration properties.
In this work we consider learning an optimal precision configuration across the layers of a deep neural network, where the precision assigned to each layer may be different. We propose a stochastic regularization technique akin to Dropout (Srivastava et al., 2014)
where a network explores a different precision configuration per example. This introduces nondifferentiable elements in the computational graph which we circumvent using recently proposed gradient estimation techniques.
2 Related Work
Recent work related to efficient learning has explored a number of different approaches to reducing the effective parameter count or memory footprint of CNN architectures. Network compression techniques (Han et al., 2015a; Wang and Liang, 2016; Choi et al., 2016; Agustsson et al., 2017) typically compress a pretrained network while minimizing the degradation of network accuracy. However, these methods are decoupled from learning, and are only suitable for efficient deployment. Network pruning techniques (Wan et al., 2013; Han et al., 2015b; Jin et al., 2016; Li et al., 2016b; Anwar et al., 2017; Molchanov et al., 2016; Tung et al., 2017) take a more iterative approach, often using regularized training and retraining to rank and modify network parameters based on their magnitude. Though coupled with the learning process, iterative pruning techniques tend to contribute to slower learning, and result in sparse connections, which are not hardware efficient.
Our work relates primarily to low precision techniques, which have tended to focus on reducing the precision of weights and activations used for deployment while maintaining dense connectivity. Courbariaux et al. were among the first to explore binary weights and activations (Courbariaux et al., 2015; Courbariaux and Bengio, 2016), demonstrating stateoftheart results for smaller datasets (MNIST, CIFAR10, and CVHN). This idea was then extended further with CNNs on larger datasets like ImageNet with binary weights and activations, while approximating convolutions using binary operations (Rastegari et al., 2016). Related work (Kim and Smaragdis, 2016) has validated these results and shown neural networks to be remarkably robust to an even wider class of nonlinear projections (Merolla et al., 2016). Ternary quantization strategies (Li et al., 2016a)
have been shown to outperform their binary counterparts, moreso when parameters of the quantization module are learned through backpropagation
(Zhu et al., 2016). Cai et al. have investigated how to improve the gradient quality of quantization operations (Cai et al., 2017), which is complimentary to our work which relies on these gradients to learn precision. Zhou et al. further explored this idea of variable precision (i.e. heterogeneity across weights and activations) and discussed the general tradeoff of precision and accuracy, exploring strategies for training with low precision gradients (Zhou et al., 2016).Our approach for learning precision closely resembles BitNet (Raghavan et al., 2017), where the optimal precision for each network layer is learned through gradient descent, and network parameter encodings act as regularizers. While BitNet uses the Lagrangian approach of adding constraints on quantization error and precision to the objective function, we allocate bits to layers through sampling from a GumbelSoftmax distribution constructed over the network layers. This has the advantage of accommodating a defined precision budget, which allows more deterministic hardware constraints, as well as a wider range of quantization encodings through the use of noninteger quantization values early in training. In the allocation of bits on a budget, our work resembles (Wang and Liang, 2016), though we allow more finegrained control over precision, and prefer a gradientbased approach over clustering techniques for learning optimal precision configurations.
To the best of our knowledge, our work is the first to explore learning precision in deep networks through a continuoustodiscrete annealed quantization strategy. Our contributions are as follows:

We experimentally confirm a linear relationship between total number of bits and speedup for low precision arithmetic, motivating the use of precision budgets.

We introduce a gradientbased approach to learning precision through sampling from a GumbelSoftmax distribution constructed over the network layers, constrained by a precision budget.

We empirically demonstrate the advantage of our endtoend training strategy as it improves model performance over simple uniform bit allocations.
3 Efficient Low Precision Networks
Low precision learning describes a set of techniques that take network parameters, typically stored at native 32bit floating point (FP32) precision, and quantize them to a much smaller range of representation, typically 1bit (binary) or 2bit (ternary) integer values. While low precision learning could refer to any combination of quantizing weights (W), activations (A), and gradients (G), most relevant work investigates the effects of quantizing weights and activations on model performance. The benefits of quantization are seen in both computational and memory efficiency, though generally speaking, quantization leads to a decrease in model accuracy (Zhou et al., 2016). However, in some cases, the effects of quantization can be lossless or even slightly improve accuracy by behaving as a type of noisy regularization (Zhu et al., 2016; Yin et al., 2016). In this work, we adopt the DoReFaNet model (Zhou et al., 2016) of quantizing all network parameters (W,A,G) albeit at different precision. Table 1 demonstrates this tradeoff of precision and accuracy for some common low precision configurations.
Model  W  A  G  Top1 Validation Error 

AlexNet (Krizhevsky et al., 2012)  32  32  32  41.4% 
BWN (Courbariaux and Bengio, 2016)  1  32  32  44.3% 
DoReFaNet (Zhou et al., 2016)  1  2  6  47.6% 
DoReFaNet (Zhou et al., 2016)  1  2  4  58.4% 
The justification for this loss in accuracy is the efficiency gain of storing and computing low precision values. The computational benefits of using binary values are seen from approximating expensive full precision matrix operations with binary operations (Rastegari et al., 2016), as well as reducing memory requirements by packing many low precision values into full precision data types. For other low precision configurations that fall between binary and full precision, a similar formulation is used. The bit dot product equation (Equation 3) shows how both the logical and and bitcount operations are used to compute the dot product of two lowbitwidth fixedpoint integers (bit vectors) (Zhou et al., 2016). Assume is a placewise bit vector formed from a sequence of bit fixedpoint integers and is a placewise bit vector formed from a sequence of bit fixedpoint integers , then
(1) 
Since the computational complexity of the operation is , the speedup is a function of the total number of bits used to quantize the inputs. Since matrix multiplications are simply sequences of dot products computed over the rows and columns of matrices, this is also true of matrix multiplication operations. As a demonstration, we implement this variable precision bit general matrix multiplication (bitGEMM) using CUDA, and show the GPU speedup for several configurations in Figure 1.
As seen in Figure 1, there is a correlation between the total number of bits used in each bitGEMM (shade) and the resulting speedup (point size). As such, from a hardware perspective, it is important to know how many total bits are used for bitGEMM operations to allow for budgeting computation and memory. It should also be noted that, in our experiments, operations with over 16 total bits of precision were shown to be slower than the full precision equivalent. This is due to the computational complexity of the worst case of being slower than the equivalent full precision operation. We therefore focus on operations with 16 or less total bits of precision.
A natural question to follow is then, for a given budget of precision (total number of bits), how do we most efficiently allocate precision to maximize model performance? We seek to answer this question by parametrizing the precision at each layer and learning these additional parameters by gradient descent.
3.1 Learning Precision
The selection of precision for variables in a model can have a significant impact on performance. Consider the DoReFaNet model — if we decide on a budget of the total number of bits to assign to the weights layerwise, and train the model under a number of different manual allocations, we obtain the training curves in Figure 2. For each training curve, the number of bits assigned to each layer’s weights are indicated by the integer at the appropriate position (e.g. 444444 indicates 6 quantized layers, all assigned 4 bits).
Varying the number of bits assigned to each layer can cause the error to change by up to several percent, but the best configuration of these bits is unclear without exhaustively testing all possible configurations. This motivates learning the most efficient allocation of precision to each layer. However, this is a difficult task for two important reasons:

unconstrained parameters that control precision will only grow, as higher precision leads to a reduction of loss; and

quantization involves discrete operations, which are nondifferentiable and therefore unsuitable for naïve backpropagation.
The first issue is easily addressed by fixing the total network precision (i.e. the sum of the bits of precision at each layer) to a budget , similar to the configurations seen in Figure 2. The task is then to learn the allocation of precision across layers. The second issue: the nondifferentiable nature of quantization operations is an unavoidable problem, as transforming continuous values into discrete values must apply some kind of discrete operator. We avoid this issue by employing a kind of stochastic allocation of precision and rely on recently developed techniques from the deep learning community to backpropagate gradients through discrete stochastic operations.
Intuitively, we can view the precision allocation procedure as sequentially allocating one bit of precision to one of
layers. Each time we allocate, we draw from a categorical variable with
outcomes, and allocate that bit to the corresponding layer. This is repeated times to match the precision budget. An allocation of bits corresponds to a particular precision configuration, and we sample a new configuration for each input example. The idea of stochastically sampling architectural configurations is akin to Dropout (Srivastava et al., 2014), where each example is processed by a different architecture with tied parameters.Different from Dropout, which uses a fixed dropout probability, we would like to parametrize the categorical distribution across layers such that we can learn to prefer to allocate precision to certain layers. Learning these parameters by gradient descent requires backprop through an operator that samples from a discrete distribution. To deal with its nondifferentiability, we use the GumbelSoftmax, also known as the Concrete distribution
(Jang et al., 2016; Maddison et al., 2016), which, using a temperature parameter, can be smoothly annealed from a uniform continuous distribution on the simplex to a discrete categorical distribution from which we sample precision allocations. This allows us to use a high temperature at the beginning of training to stochastically explore different precision configurations and use a low temperature at the end of training to discretely allocate precision to network layers according to the learned distribution.Though noninteger bits of precision can be implemented (detailed below), integer bits are more amenable to hardware implementations, which is why we aim to converge toward discrete samples. Table 2
shows examples of sampling from this distribution at different temperatures for a three class distribution. It should be noted that in order to perform unconstrained optimization we parametrize the unnormalized logits (also known as logodds) instead of the probabilities themselves.
Class 1 ()  Class 2 ()  Class 3 ()  

100.0  0.33  0.33  0.33 
10.0  0.33  0.41  0.26 
1.00  0.31  0.60  0.09 
0.10  0.00  1.00  0.00 
, where higher values correspond to higher probabilities. The GumbelSoftmax interpolates between continuous densities on the simplex (at high temperature) and discrete onehotencoded categorical distributions (at low temperature).
Since the class logits control the probability of allocating a bit of precision to a network layer , at low temperatures the onehot samples will allocate bits of precision to the network according to the learned parameters . However, at high temperatures we allocate partial bits to layers. This is possible due to our quantization straightthrough estimator (STE), quantize, adopted from (Zhou et al., 2016):
Forward:  
(2)  
(3) 
where is the real number input, is the bit output, and is the objective function. Since quantize produces a bit value on , quantizing to noninteger values of simply produces a more finegrained range of representation compared to integer values of . This is demonstrated in Table 3.
1  2  3  

0.00  0.00  0.00  0.00  0.00  0.00  0.00  
1.00  0.55  0.33  0.26  0.21  0.17  0.14  
1.00  0.66  0.53  0.42  0.34  0.28  
1.00  0.80  0.64  0.52  0.42  
1.00  0.85  0.69  0.57  
1.00  0.87  0.71  
1.00  0.85  
1.00 
Using real values for quantization also provides useful gradients for backpropagation, whereas the small finite set of possible integer values would yield zero gradient almost everywhere.
3.2 Precision Allocation Layer
We introduce a new layer type, the precision allocation layer, to implement our precision learning technique. This layer is inserted as a leaf in the computational graph, and is executed on each forwardpass of training. The precision allocation layer is parametrized by the learnable class probabilities , which define the GumbelSoftmax distribution. Each class probability is associated with a layer , so the samples assigned to each class are allocated as bits to the appropriate layer. This is illustrated in Figure 3 and stepped through in Example 1.
It should be noted that during the early stages of training before the network performance has converged, allowing the temperature to drop too low results in highvariance gradients while also encouraging largely uneven allocations of bits. This severely hurts the generalization of the network. To deal with this, we empirically observe a temperature where the class probabilities have sufficiently stabilized, and perform hard assignments of bits of precision based on these stabilized class probabilities. To do this, we sample from the GumbelSoftmax a large number of times and average the results, in order to converge on the expected class sample assignments. Once we have these precision values, we fix the layers at these precision values for the remainder of training. We observe that the regularization effects of stochastic bit allocations are most useful during early training, and performing hard assignments greatly improves generalization performance. For all experiments considered, we implement a hard assignment of bits after the temperature drops below
.4 Experiments
We evaluate the effects of our precision layer on two common image classification benchmarks, MNIST and ILSVRC12. We consider two separate CNN architectures, a 5layer network similar to LeNet5 (Lecun et al., 1998) trained on MNIST, and the AlexNet network (Krizhevsky et al., 2012) trained on ILSVRC12. We use the top1 singlecrop error as a measure of performance, and quantize both the weights and activations in all experiments considered. As in previous works (Zhou et al., 2016), the first layer of AlexNet was not quantized. Initial experiments showed that the effects of learning precision were less beneficial to gradients, so we leave them at full precision in all reported experiments.
Since our motivation is to show the precision layer as an improvement over uniformly quantized low precision models, we compare our results to networks with evenly distributed precision over all layers. We consider three common low precision budgets as powers of 2, which would provide efficient hardware acceleration, where each layer is allocated 2, 4, or 8 bits. Baseline networks with uniform precision allocation are denoted with the allocation for each layer and the budget (e.g. budget=10_22222 denotes a baseline network with 2 bits manually allocated to each of 5 layers for a budget of 10 bits), while the networks with learned precision are denoted with the precision budget and learn (e.g. budget=10_learn denotes a learned precision network with a budget of 10 total bits, averaging 2 bits per layer).
4.1 Mnist
The training curves for the MNISTtrained models are shown in Figure 4. The learned GumbelSoftmax class logits are shown in Figure 5.
We observe that the models with learned precision converge faster and reach a lower test error compared to the baseline models across all precision budgets considered, and the relative improvement is more substantial for lower precision budgets. From Figure 5, we observe that the models learn to assign fewer bits to early layers of the network (conv0, conv1) while assigning more bits to later layers of the network (conv2, conv3), as well as preferring smoother (i.e. more uniform) allocations. This result agrees with the empirical observations of (Raghavan et al., 2017). The results on MNIST are summarized in Table 4. Uncertainty is calculated by averaging over 10 runs for each network with different random initializations of the parameters.
Network  Precision Budget  Val. Error  Final Bit Allocation  

budget=10_22222  10 Bits  
budget=10_learn  10 Bits  
budget=20_44444  20 Bits  
budget=20_learn  20 Bits  
budget=40_88888  40 Bits  
budget=40_learn  40 Bits 
Network  Precision Budget  Val. Error  Final Bit Allocation  
budget=14_2222222  14 Bits  
budget=14_learn  14 Bits  
budget=28_4444444  28 Bits  
budget=28_learn  28 Bits  
budget=56_8888888  56 Bits  
budget=56_learn  56 Bits 
4.2 Ilsvrc12
The results for ILSVRC12 are summarized in Table 5. Again, models that employ stochastic precision allocation converge faster and ultimately reach a lower test error than their fixedprecision counterparts on the same budget. We observe that networks trained with stochastic precision learn to take bits from early layers and assign these to later layers, similar to the MNIST results. While the class logits for the MNIST network were similar to the ILSVRC12 results, they were not substantial enough to cause changes in bit allocation during our hard assignments. However, the ILSVRC12trained networks actually make nonuniform hard assignments. This suggests that the precision layer has a larger affect on more complex networks.
5 Conclusion and Future Work
We introduced a precision allocation layer for DNNs, and proposed a stochastic allocation scheme for learning precision on a fixed budget. We have shown that learned precision models outperform uniformlyallocated low precision models. This effect is due to both learning the optimal configuration of precision layerwise, as well as the regularization effects of stochastically exploring different precision configurations during training. Moreover, the use of precision budgets allow a high level of hardware acceleration determinism which has practical implications.
While the present experiments were focused on accuracy rather than computational efficiency, future work will examine using GPU bit kernels in place of the full precision kernels we used in our experiments. We also intend to investigate stochastic precision in the adversarial setting. This is inspired by Galloway et al. (2018), who report that stochastic quantization at test time yields robustness towards iterative attacks.
Finally, we are interested in a variant of the model where rather than directly parametrizing precision, precision is conditioned on the input. While this reduces hardware acceleration determinism in realtime or memoryconstrained settings, it would enable a DNN to dynamically adapt its precision configuration to individual examples.
References
 Agustsson et al. [2017] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool. Softtohard vector quantization for endtoend learned compression of images and neural networks. arXiv preprint arXiv:1704.00648, 2017.
 Anwar et al. [2017] S. Anwar, K. Hwang, and W. Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.
 Bahdanau et al. [2014] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Cai et al. [2017] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. arXiv preprint arXiv:1702.00953, 2017.
 Canziani et al. [2016] A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
 Choi et al. [2016] Y. Choi, M. ElKhamy, and J. Lee. Towards the limit of network quantization. arXiv preprint arXiv:1612.01543, 2016.
 Courbariaux and Bengio [2016] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363, 2015.

Deng et al. [2009]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition
. IEEE, 2009. 
Galloway et al. [2018]
A. Galloway, G. W. Taylor, and M. Moussa.
Attacking binarized neural networks.
In International Conference on Learning Representations (ICLR), 2018.  Goyal et al. [2017] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Han et al. [2015a] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. [2015b] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015b.
 Hinton et al. [2012] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Jang et al. [2016] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with GumbelSoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Jin et al. [2016] X. Jin, X. Yuan, J. Feng, and S. Yan. Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423, 2016.
 Kim and Smaragdis [2016] M. Kim and P. Smaragdis. Bitwise neural networks. arXiv preprint arXiv:1601.06071, 2016.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.
 Lecun et al. [1998] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
 Li et al. [2016a] F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016a.
 Li et al. [2016b] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016b.
 Maddison et al. [2016] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Merolla et al. [2016] P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha. Deep neural networks are robust to weight binarization and other nonlinear distortions. arXiv preprint arXiv:1606.01981, 2016.
 Molchanov et al. [2016] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient transfer learning. arXiv preprint arXiv:1611.06440, 2016.
 Raghavan et al. [2017] A. Raghavan, M. Amer, S. Chai, and G. W. Taylor. BitNet: Bitregularized deep neural networks. arXiv preprint arXiv:1708.04788, 2017.
 Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. arXiv preprint arXiv:1603.05279, 2016.
 Srivastava et al. [2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
 Tung et al. [2017] F. Tung, S. Muralidharan, and G. Mori. FinePruning: Joint FineTuning and compression of a convolutional network with bayesian optimization. In British Machine Vision Conference (BMVC), 2017.
 Wan et al. [2013] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In International Conference on Learning Representations (ICLR), pages 1058–1066, 2013.
 Wang and Liang [2016] X. Wang and J. Liang. Scalable compression of deep neural networks. In Proceedings of the ACM on Multimedia Conference, pages 511–515, 2016.
 Yin et al. [2016] P. Yin, S. Zhang, Y. Qi, and J. Xin. Quantization and training of low BitWidth convolutional neural networks for object detection. arXiv preprint arXiv:1612.06052, 2016.
 Zhou et al. [2016] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. DoReFaNet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 Zhu et al. [2016] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Comments
There are no comments yet.