Stochastic Layer-Wise Precision in Deep Neural Networks

07/03/2018 ∙ by Griffin Lacey, et al. ∙ 0

Low precision weights, activations, and gradients have been proposed as a way to improve the computational efficiency and memory footprint of deep neural networks. Recently, low precision networks have even shown to be more robust to adversarial attacks. However, typical implementations of low precision DNNs use uniform precision across all layers of the network. In this work, we explore whether a heterogeneous allocation of precision across a network leads to improved performance, and introduce a learning scheme where a DNN stochastically explores multiple precision configurations through learning. This permits a network to learn an optimal precision configuration. We show on convolutional neural networks trained on MNIST and ILSVRC12 that even though these nets learn a uniform or near-uniform allocation strategy respectively, stochastic precision leads to a favourable regularization effect improving generalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep learning, and convolutional neural networks (CNN) in particular, have led to well-publicized breakthroughs in computer vision

(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012)

, and natural language processing (NLP)

(Bahdanau et al., 2014). Modern CNNs, however, have increasingly large storage and computational requirements (Canziani et al., 2016)

. This has limited the application scope to data centres that can accommodate clusters of massively parallel hardware accelerators, such as graphics processing units (GPUs). Still, GPU training of CNNs on large datasets like ImageNet

(Deng et al., 2009) can take hours, even on networks with hundreds of GPUs, and achieving linear scaling beyond these sizes is difficult (Goyal et al., 2017). As such, there is a growing interest in investigating more fine-grained optimizations, especially since deployment on embedded devices with limited power, compute, and memory budgets remains an imposing challenge.

Research efforts to reduce model size and speed up inference have shown that training networks with binary or ternary weights and activations (Courbariaux and Bengio, 2016; Rastegari et al., 2016; Li et al., 2016a) can achieve comparable accuracy to full precision networks, while benefiting from reduced memory requirements and improved computational efficiency using bit operations. They may even confer additional robustness to adversarial attacks (Galloway et al., 2018). More recently, the DoReFa-Net model has generalized this finding to include different precision settings for weights vs. activations, and demonstrated how low precision gradients can be also employed at training time (Zhou et al., 2016).

These findings suggest that precision in deep learning is not an arbitrary design choice, but rather a dial that controls the trade-off between model complexity and accuracy. However, precision is typically considered at the design level of an entire model, making it difficult to consider as a tunable hyperparameter. We posit that considering precision at a finer granularity, such as a layer or even per-example could grant models more flexibility in which to find optimal configurations, which maximizes accuracy and minimizes computational cost. To remain deterministic about hardware efficiency, we aim to do this for fixed budgets of precision, which have predictable acceleration properties.

In this work we consider learning an optimal precision configuration across the layers of a deep neural network, where the precision assigned to each layer may be different. We propose a stochastic regularization technique akin to Dropout (Srivastava et al., 2014)

where a network explores a different precision configuration per example. This introduces non-differentiable elements in the computational graph which we circumvent using recently proposed gradient estimation techniques.

2 Related Work

Recent work related to efficient learning has explored a number of different approaches to reducing the effective parameter count or memory footprint of CNN architectures. Network compression techniques (Han et al., 2015a; Wang and Liang, 2016; Choi et al., 2016; Agustsson et al., 2017) typically compress a pre-trained network while minimizing the degradation of network accuracy. However, these methods are decoupled from learning, and are only suitable for efficient deployment. Network pruning techniques (Wan et al., 2013; Han et al., 2015b; Jin et al., 2016; Li et al., 2016b; Anwar et al., 2017; Molchanov et al., 2016; Tung et al., 2017) take a more iterative approach, often using regularized training and retraining to rank and modify network parameters based on their magnitude. Though coupled with the learning process, iterative pruning techniques tend to contribute to slower learning, and result in sparse connections, which are not hardware efficient.

Our work relates primarily to low precision techniques, which have tended to focus on reducing the precision of weights and activations used for deployment while maintaining dense connectivity. Courbariaux et al. were among the first to explore binary weights and activations (Courbariaux et al., 2015; Courbariaux and Bengio, 2016), demonstrating state-of-the-art results for smaller datasets (MNIST, CIFAR-10, and CVHN). This idea was then extended further with CNNs on larger datasets like ImageNet with binary weights and activations, while approximating convolutions using binary operations (Rastegari et al., 2016). Related work (Kim and Smaragdis, 2016) has validated these results and shown neural networks to be remarkably robust to an even wider class of non-linear projections (Merolla et al., 2016). Ternary quantization strategies (Li et al., 2016a)

have been shown to outperform their binary counterparts, moreso when parameters of the quantization module are learned through backpropagation

(Zhu et al., 2016). Cai et al. have investigated how to improve the gradient quality of quantization operations (Cai et al., 2017), which is complimentary to our work which relies on these gradients to learn precision. Zhou et al. further explored this idea of variable precision (i.e. heterogeneity across weights and activations) and discussed the general trade-off of precision and accuracy, exploring strategies for training with low precision gradients (Zhou et al., 2016).

Our approach for learning precision closely resembles BitNet (Raghavan et al., 2017), where the optimal precision for each network layer is learned through gradient descent, and network parameter encodings act as regularizers. While BitNet uses the Lagrangian approach of adding constraints on quantization error and precision to the objective function, we allocate bits to layers through sampling from a Gumbel-Softmax distribution constructed over the network layers. This has the advantage of accommodating a defined precision budget, which allows more deterministic hardware constraints, as well as a wider range of quantization encodings through the use of non-integer quantization values early in training. In the allocation of bits on a budget, our work resembles (Wang and Liang, 2016), though we allow more fine-grained control over precision, and prefer a gradient-based approach over clustering techniques for learning optimal precision configurations.

To the best of our knowledge, our work is the first to explore learning precision in deep networks through a continuous-to-discrete annealed quantization strategy. Our contributions are as follows:

  • We experimentally confirm a linear relationship between total number of bits and speedup for low precision arithmetic, motivating the use of precision budgets.

  • We introduce a gradient-based approach to learning precision through sampling from a Gumbel-Softmax distribution constructed over the network layers, constrained by a precision budget.

  • We empirically demonstrate the advantage of our end-to-end training strategy as it improves model performance over simple uniform bit allocations.

3 Efficient Low Precision Networks

Low precision learning describes a set of techniques that take network parameters, typically stored at native 32-bit floating point (FP32) precision, and quantize them to a much smaller range of representation, typically 1-bit (binary) or 2-bit (ternary) integer values. While low precision learning could refer to any combination of quantizing weights (W), activations (A), and gradients (G), most relevant work investigates the effects of quantizing weights and activations on model performance. The benefits of quantization are seen in both computational and memory efficiency, though generally speaking, quantization leads to a decrease in model accuracy (Zhou et al., 2016). However, in some cases, the effects of quantization can be lossless or even slightly improve accuracy by behaving as a type of noisy regularization (Zhu et al., 2016; Yin et al., 2016). In this work, we adopt the DoReFa-Net model (Zhou et al., 2016) of quantizing all network parameters (W,A,G) albeit at different precision. Table 1 demonstrates this trade-off of precision and accuracy for some common low precision configurations.

Model W A G Top-1 Validation Error
AlexNet (Krizhevsky et al., 2012) 32 32 32 41.4%
BWN (Courbariaux and Bengio, 2016) 1 32 32 44.3%
DoReFa-Net (Zhou et al., 2016) 1 2 6 47.6%
DoReFa-Net (Zhou et al., 2016) 1 2 4 58.4%
Table 1: DoReFa-Net single-crop top-1 validation error for common weight (W), activation (A), and gradient (G) quantization configurations on the ImageNet Large-Scale Visual Recognition Challenge 2012 (ILSVRC12) dataset. The results are slightly improved over the results originally reported and were reproduced by us based on the public DoReFa-Net (Zhou et al., 2016) codebase.

The justification for this loss in accuracy is the efficiency gain of storing and computing low precision values. The computational benefits of using binary values are seen from approximating expensive full precision matrix operations with binary operations (Rastegari et al., 2016), as well as reducing memory requirements by packing many low precision values into full precision data types. For other low precision configurations that fall between binary and full precision, a similar formulation is used. The bit dot product equation (Equation 3) shows how both the logical and and bitcount operations are used to compute the dot product of two low-bitwidth fixed-point integers (bit vectors) (Zhou et al., 2016). Assume is a placewise bit vector formed from a sequence of -bit fixed-point integers and is a placewise bit vector formed from a sequence of -bit fixed-point integers , then

(1)

Since the computational complexity of the operation is , the speedup is a function of the total number of bits used to quantize the inputs. Since matrix multiplications are simply sequences of dot products computed over the rows and columns of matrices, this is also true of matrix multiplication operations. As a demonstration, we implement this variable precision bit general matrix multiplication (bit-GEMM) using CUDA, and show the GPU speedup for several configurations in Figure 1.

Figure 1: GPU-based bit-GEMM speedup for low precision matrices A and B. Results are compared with a similarly optimized 32-bit GEMM kernel, and run on a NVIDIA Tesla V100 GPU.

As seen in Figure 1, there is a correlation between the total number of bits used in each bit-GEMM (shade) and the resulting speedup (point size). As such, from a hardware perspective, it is important to know how many total bits are used for bit-GEMM operations to allow for budgeting computation and memory. It should also be noted that, in our experiments, operations with over 16 total bits of precision were shown to be slower than the full precision equivalent. This is due to the computational complexity of the worst case of being slower than the equivalent full precision operation. We therefore focus on operations with 16 or less total bits of precision.

A natural question to follow is then, for a given budget of precision (total number of bits), how do we most efficiently allocate precision to maximize model performance? We seek to answer this question by parametrizing the precision at each layer and learning these additional parameters by gradient descent.

3.1 Learning Precision

The selection of precision for variables in a model can have a significant impact on performance. Consider the DoReFa-Net model — if we decide on a budget of the total number of bits to assign to the weights layer-wise, and train the model under a number of different manual allocations, we obtain the training curves in Figure 2. For each training curve, the number of bits assigned to each layer’s weights are indicated by the integer at the appropriate position (e.g. 444444 indicates 6 quantized layers, all assigned 4 bits).

Figure 2:

The training error for a variety of DoReFa-Net manual precision allocations is plotted for each weight update. The uniform distribution of bits (i.e. 

444444) leads to the lowest error, while the least uniform configurations (e.g. 222288, 882222) lead to the highest error.

Varying the number of bits assigned to each layer can cause the error to change by up to several percent, but the best configuration of these bits is unclear without exhaustively testing all possible configurations. This motivates learning the most efficient allocation of precision to each layer. However, this is a difficult task for two important reasons:

  • unconstrained parameters that control precision will only grow, as higher precision leads to a reduction of loss; and

  • quantization involves discrete operations, which are non-differentiable and therefore unsuitable for naïve backpropagation.

The first issue is easily addressed by fixing the total network precision (i.e. the sum of the bits of precision at each layer) to a budget , similar to the configurations seen in Figure 2. The task is then to learn the allocation of precision across layers. The second issue: the non-differentiable nature of quantization operations is an unavoidable problem, as transforming continuous values into discrete values must apply some kind of discrete operator. We avoid this issue by employing a kind of stochastic allocation of precision and rely on recently developed techniques from the deep learning community to backpropagate gradients through discrete stochastic operations.

Intuitively, we can view the precision allocation procedure as sequentially allocating one bit of precision to one of

layers. Each time we allocate, we draw from a categorical variable with

outcomes, and allocate that bit to the corresponding layer. This is repeated times to match the precision budget. An allocation of bits corresponds to a particular precision configuration, and we sample a new configuration for each input example. The idea of stochastically sampling architectural configurations is akin to Dropout (Srivastava et al., 2014), where each example is processed by a different architecture with tied parameters.

Different from Dropout, which uses a fixed dropout probability, we would like to parametrize the categorical distribution across layers such that we can learn to prefer to allocate precision to certain layers. Learning these parameters by gradient descent requires backprop through an operator that samples from a discrete distribution. To deal with its non-differentiability, we use the Gumbel-Softmax, also known as the Concrete distribution

(Jang et al., 2016; Maddison et al., 2016), which, using a temperature parameter, can be smoothly annealed from a uniform continuous distribution on the simplex to a discrete categorical distribution from which we sample precision allocations. This allows us to use a high temperature at the beginning of training to stochastically explore different precision configurations and use a low temperature at the end of training to discretely allocate precision to network layers according to the learned distribution.

Though non-integer bits of precision can be implemented (detailed below), integer bits are more amenable to hardware implementations, which is why we aim to converge toward discrete samples. Table 2

shows examples of sampling from this distribution at different temperatures for a three class distribution. It should be noted that in order to perform unconstrained optimization we parametrize the unnormalized logits (also known as log-odds) instead of the probabilities themselves.

Class 1 () Class 2 () Class 3 ()
100.0 0.33 0.33 0.33
10.0 0.33 0.41 0.26
1.00 0.31 0.60 0.09
0.10 0.00 1.00 0.00
Table 2: Examples of single samples drawn from a Gumbel-Softmax distribution, parametrized by logits for three classes and a variety of temperatures . The probability of allocating a bit to layer is impacted by the logit

, where higher values correspond to higher probabilities. The Gumbel-Softmax interpolates between continuous densities on the simplex (at high temperature) and discrete one-hot-encoded categorical distributions (at low temperature).

Since the class logits control the probability of allocating a bit of precision to a network layer , at low temperatures the one-hot samples will allocate bits of precision to the network according to the learned parameters . However, at high temperatures we allocate partial bits to layers. This is possible due to our quantization straight-through estimator (STE), quantize, adopted from (Zhou et al., 2016):

Forward:
(2)
(3)

where is the real number input, is the -bit output, and is the objective function. Since quantize produces a -bit value on , quantizing to non-integer values of simply produces a more fine-grained range of representation compared to integer values of . This is demonstrated in Table 3.

1 2 3
0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 0.55 0.33 0.26 0.21 0.17 0.14
1.00 0.66 0.53 0.42 0.34 0.28
1.00 0.80 0.64 0.52 0.42
1.00 0.85 0.69 0.57
1.00 0.87 0.71
1.00 0.85
1.00
Table 3: The possible output values of the quantize operation are shown for a variety of values of , clipped between 0 and 1. Non-integer values of provide a more fine-grained range of representation between successive integer values (underlined).

Using real values for quantization also provides useful gradients for backpropagation, whereas the small finite set of possible integer values would yield zero gradient almost everywhere.

3.2 Precision Allocation Layer

We introduce a new layer type, the precision allocation layer, to implement our precision learning technique. This layer is inserted as a leaf in the computational graph, and is executed on each forward-pass of training. The precision allocation layer is parametrized by the learnable class probabilities , which define the Gumbel-Softmax distribution. Each class probability is associated with a layer , so the samples assigned to each class are allocated as bits to the appropriate layer. This is illustrated in Figure 3 and stepped through in Example 1.

Figure 3: Early in training, the Gumbel-Softmax class probabilities are initialized to 0 while the temperature is high, generally resulting in uniform allocations of real-valued bits of precision. Later in training, with a low temperature, the learned class probabilities usually result in some layers being allocated more or less discrete bits of precision. In practice, the computational overhead of the precision layer is not noticeable.

[roundcorner=10pt,backgroundcolor=white]

Example 1.
Consider a network with 4 layers, each denoted by . To learn the precision of these layers, we add a precision layer which constructs a Gumbel-Softmax distribution with 4 classes

, where each class is assigned to a layer. Each class probability is initalized to 0.0, and the temperature is initialized to 50.0. For a budget of 16 total bits, two example iterations representative of early training (first iteration of epoch 0) and late training (first iteration of epoch 50) are shown below:

Epoch = 0, , , , , The precision layer is executed first, which means we sample from the Gumbel-Softmax 16 times because our budget is 16 bits, and accumulate the results of the 16 samples. Each individual sample gives us 4 class outputs which sum to 1 (e.g. [0.25, 0.25, 0.25, 0.25]), so sampling 16 times and accumulating the results for each class means our final results will add to 16. Since the class probabilities are initialized to 0.0, and the temperature is high, the expected samples will be continuous and relatively uniform across all classes. The samples associated with each layer are: , , , . We now assign 4.1 bits to layer 1, 3.9 bits to layer 2, 4.2 bits to layer 3, and 3.8 bits to layer 4. These bits are assigned to the appropriate layer by applying the quantize operation of Equation 3.1 to the desired parameters (e.g. weights) with as the appropriate bit assignment (e.g. = 4.1 for ). The quantize operation will transform the layer parameters to one of several discrete positive quantities between 0 and 1 (see Table 3). Though these are fractional bit assignments, the quantize operations works the same as if these were integer bit assignments. The iteration then proceeds as normal, with the quantized parameters and class probabilities updated during back-propagation.

Epoch = 50, , , , ,

  • Similar to before, we begin by sampling from the Gumbel-Softmax 16 times because our budget is 16 bits, and accumulate the results. However, the class probabilities have now changed such that corresponds to the most likely class, and the least likely class, so the distribution is no longer uniform. As well, since the temperature is low (), the samples will now approach discrete.

    • Samples: , , , .

    • We now assign 5.0 bits to layer 1, 4.0 bits to layer 2, 4.0 bits to layer 3, and 3.0 bits to layer 4, similar to before. Since these bit assignments are integer values, the activity of the quantize operation is more intuitive.

    • Similar to before, the quantize operation will transform the layer parameters to one of several discrete positive quantities between 0 and 1 (see Table 3).

    • The forward-pass then proceeds as normal, with the quantized parameters and class probabilities updated during back-propagation. Since this is late training, parameter updates will be smaller in magnitude than in early training.

It should be noted that during the early stages of training before the network performance has converged, allowing the temperature to drop too low results in high-variance gradients while also encouraging largely uneven allocations of bits. This severely hurts the generalization of the network. To deal with this, we empirically observe a temperature where the class probabilities have sufficiently stabilized, and perform hard assignments of bits of precision based on these stabilized class probabilities. To do this, we sample from the Gumbel-Softmax a large number of times and average the results, in order to converge on the expected class sample assignments. Once we have these precision values, we fix the layers at these precision values for the remainder of training. We observe that the regularization effects of stochastic bit allocations are most useful during early training, and performing hard assignments greatly improves generalization performance. For all experiments considered, we implement a hard assignment of bits after the temperature drops below

.

4 Experiments

We evaluate the effects of our precision layer on two common image classification benchmarks, MNIST and ILSVRC12. We consider two separate CNN architectures, a 5-layer network similar to LeNet-5 (Lecun et al., 1998) trained on MNIST, and the AlexNet network (Krizhevsky et al., 2012) trained on ILSVRC12. We use the top-1 single-crop error as a measure of performance, and quantize both the weights and activations in all experiments considered. As in previous works (Zhou et al., 2016), the first layer of AlexNet was not quantized. Initial experiments showed that the effects of learning precision were less beneficial to gradients, so we leave them at full precision in all reported experiments.

Since our motivation is to show the precision layer as an improvement over uniformly quantized low precision models, we compare our results to networks with evenly distributed precision over all layers. We consider three common low precision budgets as powers of 2, which would provide efficient hardware acceleration, where each layer is allocated 2, 4, or 8 bits. Baseline networks with uniform precision allocation are denoted with the allocation for each layer and the budget (e.g. budget=10_22222 denotes a baseline network with 2 bits manually allocated to each of 5 layers for a budget of 10 bits), while the networks with learned precision are denoted with the precision budget and learn (e.g. budget=10_learn denotes a learned precision network with a budget of 10 total bits, averaging 2 bits per layer).

4.1 Mnist

The training curves for the MNIST-trained models are shown in Figure 4. The learned Gumbel-Softmax class logits are shown in Figure 5.

Figure 4: The top-1 errors for the MNIST networks are shown. Both baseline networks with uniform precision allocation (i.e. budget=10_22222) and learned precision networks (i.e. budget=10_learn) are assigned the same total precision budget, indicated by the prefix.
Figure 5: The learned Gumbel-Softmax class logits for different precision budgets.

We observe that the models with learned precision converge faster and reach a lower test error compared to the baseline models across all precision budgets considered, and the relative improvement is more substantial for lower precision budgets. From Figure 5, we observe that the models learn to assign fewer bits to early layers of the network (conv0, conv1) while assigning more bits to later layers of the network (conv2, conv3), as well as preferring smoother (i.e. more uniform) allocations. This result agrees with the empirical observations of (Raghavan et al., 2017). The results on MNIST are summarized in Table 4. Uncertainty is calculated by averaging over 10 runs for each network with different random initializations of the parameters.

Network Precision Budget Val. Error Final Bit Allocation
budget=10_22222 10 Bits
budget=10_learn 10 Bits
budget=20_44444 20 Bits
budget=20_learn 20 Bits
budget=40_88888 40 Bits
budget=40_learn 40 Bits
Table 4: Final training results and top-1 error for the MNIST baseline and learned precision models.
Network Precision Budget Val. Error Final Bit Allocation
budget=14_2222222 14 Bits
budget=14_learn 14 Bits
budget=28_4444444 28 Bits
budget=28_learn 28 Bits
budget=56_8888888 56 Bits
budget=56_learn 56 Bits
Table 5: Final training results and top-1error for the ILSVRC12 baseline and learned precision models.

4.2 Ilsvrc12

The results for ILSVRC12 are summarized in Table 5. Again, models that employ stochastic precision allocation converge faster and ultimately reach a lower test error than their fixed-precision counterparts on the same budget. We observe that networks trained with stochastic precision learn to take bits from early layers and assign these to later layers, similar to the MNIST results. While the class logits for the MNIST network were similar to the ILSVRC12 results, they were not substantial enough to cause changes in bit allocation during our hard assignments. However, the ILSVRC12-trained networks actually make non-uniform hard assignments. This suggests that the precision layer has a larger affect on more complex networks.

5 Conclusion and Future Work

We introduced a precision allocation layer for DNNs, and proposed a stochastic allocation scheme for learning precision on a fixed budget. We have shown that learned precision models outperform uniformly-allocated low precision models. This effect is due to both learning the optimal configuration of precision layer-wise, as well as the regularization effects of stochastically exploring different precision configurations during training. Moreover, the use of precision budgets allow a high level of hardware acceleration determinism which has practical implications.

While the present experiments were focused on accuracy rather than computational efficiency, future work will examine using GPU bit kernels in place of the full precision kernels we used in our experiments. We also intend to investigate stochastic precision in the adversarial setting. This is inspired by Galloway et al. (2018), who report that stochastic quantization at test time yields robustness towards iterative attacks.

Finally, we are interested in a variant of the model where rather than directly parametrizing precision, precision is conditioned on the input. While this reduces hardware acceleration determinism in real-time or memory-constrained settings, it would enable a DNN to dynamically adapt its precision configuration to individual examples.

References