Learning Low Precision Deep Neural Networks through Regularization

09/01/2018 ∙ by Yoojin Choi, et al. ∙ SAMSUNG 0

We consider the quantization of deep neural networks (DNNs) to produce low-precision models for efficient inference of fixed-point operations. Compared to previous approaches to training quantized DNNs directly under the constraints of low-precision weights and activations, we learn the quantization of DNNs with minimal quantization loss through regularization. In particular, we introduce the learnable regularization coefficient to find accurate low-precision models efficiently in training. In our experiments, the proposed scheme yields the state-of-the-art low-precision models of AlexNet and ResNet-18, which have better accuracy than their previously available low-precision models. We also examine our quantization method to produce low-precision DNNs for image super resolution. We observe only 0.5 dB peak signal-to-noise ratio (PSNR) loss when using binary weights and 8-bit activations. The proposed scheme can be used to train low-precision models from scratch or to fine-tune a well-trained high-precision model to converge to a low-precision model. Finally, we discuss how a similar regularization method can be adopted in DNN weight pruning and compression, and show that 401× compression is achieved for LeNet-5.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks (DNNs) have recently achieved performance breakthroughs in many of computer vision tasks 

[1]. The state-of-the-art performance of modern DNNs comes with over-parametrized complex structures, and nowadays millions or tens of millions parameters in more than one hundred layers are not exceptional anymore. DNN quantization for efficient inference is of great interest particularly for deployment of large-size DNNs on resource-limited platforms such as battery-powered mobile devices (e.g., see [2, 3]

). In such hardware platforms, not only memory and power are limited but also basic floating-point arithmetic operations are in some cases not supported. Hence, it is preferred and sometimes necessary to deliver fixed-point DNN models of low-precision weights and activations (feature maps) for curtailing memory requirements and reducing computational costs.

In this paper, we propose a DNN quantization method by regularization to generate quantized models of low-precision weights and activations with minimal quantization loss.

  • For weight quantization, we propose training DNNs with the regularization term of the mean-squared-quantization-error (MSQE) for weights. The loss due to quantization is explicitly reduced by the MSQE regularizer. In particular, we define the regularization coefficient as a learnable parameter to derive accurate low-precision models efficiently in training. The common scaling factor (i.e., quantization cell size) for weights in each layer is set to be a learnable parameter as well and is optimized in training to minimize the MSQE.

  • We quantize the activation output of each layer and pass quantized activations as the input to the following layer. Similar to weight quantization, the common scaling factor for quantized activations of each layer is optimized while training to minimize the MSQE.

  • We furthermore present a method of regularizing power-of-two scaling factors for weights and activations, which can be added optionally to the proposed regularization method for DNN quantization. Power-of-two scaling can be computationally advantageous when implemented by simple bit-shift rather than scalar multiplication.

Using the proposed DNN quantization scheme, we obtain low-precision AlexNet [4] and ResNet-18 [5]

models that achieve higher accuracy in ImageNet classification 

[6]

than their previously available low-precision models from XNOR-Net 

[7], DoReFa-Net [8] and HWGQ [9]. We utilize our quantization method to produce low-precision (CT-)SRCNN [10, 11] models of binary weights and 8-bit activations for image super resolution, and observe only  dB peak signal-to-noise ratio (PSNR) loss. We finally discuss how our regularization method can be altered for DNN weight pruning and compression, and the compression ratio of is achieved for LeNet-5.

Ii Previous work and our contributions

Low-precision DNNs have been studied extensively in deep learning recently 

[12, 13, 14, 15, 16, 17, 18, 19]. Some extremes of low-precision DNNs of binary or ternary weights can be found in [20, 21, 22, 7, 8, 23, 24, 25, 9, 26, 27]. The previous work either focused on quantization of pre-trained models with/without re-training or considered training low-precision models from scratch.

To train low-precision DNNs, a series of papers on binary neural networks [20, 21, 22]

suggests the method of utilizing high-precision shadow weights to accumulate high-precision gradient values in training. The high-precision weights are binarized (or quantized) after being updated in every training iteration, and then the gradients are computed from the network loss function evaluated with the binarized (or quantized) weights. In this framework, stochastic rounding 

[28, 13]

is examined for better convergence instead of deterministic rounding and the straight-through estimator 

[29]

is employed for the backpropagation of gradients through the quantization function. These training techniques are further optimized and enhanced in the subsequent work 

[7, 8, 25, 9, 26, 27]

. Scaling factors for quantized weights and activations are either fixed before training or heuristically updated in training, e.g., by the overflow rate 

[12, 18].

In this paper, we propose a method for learning quantized low-precision models through regularization. The quantization loss is explicitly defined in the MSQE regularization term and the network is trained to reduce it as much as possible, along with the main target (e.g., classification) loss function. In particular, the high-precision weights gradually converge around the quantization centroids as we minimize the MSQE in training, and therefore the gradient descent becomes more accurate than the conventional methods without regularization. Moreover, using the learnable regularization coefficient, the network is guided to reach an accurate quantized model with smooth convergence. The scaling factors for quantized weights and activations are also optimized systematically to minimize the MSQE in training.

We emphasize that our DNN quantization scheme through regularization is different from the loss-aware weight quantization in [26, 27], where approximate solutions using the proximal Newton algorithm are presented to minimize the network loss function under the constraints of low-precision weights. No regularization is considered in [26, 27].

Weight sharing is another DNN compression scheme studied in [30, 31, 32, 33, 34, 35, 36, 37, 38]. It reduces the number of distinct weight values in DNNs by quantization. Contrary to low-precision weights from linear quantization, weight sharing allows non-linear quantization, where quantization output levels do not have to be evenly spaced. Hence, quantized weights from weight sharing are represented in high precision, implying that high-precision arithmetic operations are still needed in inference, although we compress them in size by lossless source coding.

Weight pruning is a special case of weight sharing where the shared value is zero. It curtails redundant weights completely from DNNs so that one can even skip computations for pruned ones. Some of successful pruning algorithms can be found in [39, 40, 41, 42, 43]. In this paper, we discuss how regularization can be used for weight pruning and show that we achieve compression for the exemplary LeNet-5 model by combining our weight pruning and quantization schemes.

To the best of our knowledge, we are also the first to evaluate low-precision DNNs for a regression problem, i.e., image super resolution. The image super resolution problem is to synthesize a high-resolution image from a low-resolution one. The DNN output is the high-resolution image corresponding to the input low-resolution image, and thus the loss due to quantization is more prominent. Using the proposed quantization method, we show by experiments that we can quantize super resolution DNNs successfully with binary weights and 8-bit activations at marginal accuracy loss in both the objective image quality metric measured by the peak signal-to-noise ratio (PSNR) and the perceptual score measured by the structured similarity index (SSIM) [44].

Iii Low-precision DNN model

We consider low-precision DNNs that are capable of efficient processing in the inference stage by using fixed-point arithmetic operations. In particular, we focus on the fixed-point implementation of convolutional and fully-connected layers, since they are the dominant parts of computational costs and memory requirements in DNNs [2, Table II].

The major bottleneck of efficient DNN processing is known to be in memory accesses [2, Section V-B]. Horowitz provides rough energy costs of various arithmetic and memory access operations for 45 nm technology in [45, Figure 1.1.9], where we can find that memory accesses typically consume more energy than arithmetic operations, and the memory access cost increases with the read size. Hence, for example, deploying binary models, instead of 32-bit models, it is expected to reduce their energy consumption by at least, due to times fewer memory accesses.

Low-precision weights and activations basically stem from linear quantization (e.g., see [46, Section 5.4]), where quantization cell boundaries are uniformly spaced and quantization output levels are the midpoints of cell intervals. Quantized weights and activations are represented by fixed-point numbers of small bit-width. Common scaling factors (i.e., quantization cell sizes) are defined in each layer for fixed-point weights and activations, respectively, to alter their dynamic ranges.

Fig. 1: Low-precision convolutional layer using fixed-point (FXP) convolution and bias addition.

Figure 1 shows the fixed-point design of a general convolutional layer consisting of convolution, bias addition and non-linear activation. Fixed-point weights and input feature maps are given with common scaling factors  and , respectively, where is the layer index. Then, the convolution operation can be implemented by fixed-point multipliers and accumulators. Biases are added, if present, after the convolution, and then the output is scaled properly by the product of the scaling factors for weights and input feature maps, i.e., , as shown in the figure. Here, the scaling factor for the biases is specially set to be

so that fixed-point bias addition can be done easily without another scaling. Then, a non-linear activation function follows. Finally, the output activations are fed into the next layer as the input.

Using rectified linear unit (ReLU) activation, two scaling operations across two layers, i.e., scaling operations by

and , can be combined into one scaling operation by before (or after) ReLU activation. Furthermore, if the scaling factors are power-of-two numbers, then one can even implement scaling by bit-shift. Similarly, low-precision fully-connected layers can be implemented by replacing convolution with matrix multiplication in the figure.

Iv Regularization for quantization

In this section, we present the regularizers that are utilized to learn the quantized DNNs of low-precision weights and activations. We first define the quantization function. Given the number of bits, i.e., bit-width , the quantization function yields

(1)

where is the input and is the scaling factor; the rounding and clipping functions satisfy

where is the largest integer smaller than or equal to . For ReLU activation, the ReLU output is always non-negative, and thus we use the unsigned quantization function given by

(2)

for , where .

Iv-a Regularization for weight quantization

Consider a general non-linear DNN model consisting of layers. Let be the sets of weights in layers to , respectively. For notational simplicity, we let

for any symbol . For weight quantization, we define the MSQE regularizer for weights of all layers as

(3)

where is the bit-width for quantized weights, is the scaling factor (i.e., quantization cell size) for quantized weights in layer , and is the total number of weights in all layers, i.e., . We assumed that bit-width  is the same for all layers, just for notational simplicity, but it can be easily extended to more general cases such that each layer has a different bit-width.

Including the MSQE regularizer in (3), the cost function to optimize in training is given by

(4)

where, with a slight abuse of notation, denotes the set of quantized weights of all layers, is the target loss function evaluated on the training dataset  using all the quantized weights, and is the regularization coefficient, for . We set the scaling factors  to be learnable parameters and optimize them along with weights .

Remark 1.

We clarify that our training uses high precision for its backward passes and gradient decent. However, its forward passes use quantized low-precision weights and activations, and the main target network loss function  is also calculated with the quantized low-precision weights and activations to mimic the low-precision inference-stage loss. Hence, the final trained models are low-precision models, which can be operated on low-precision fixed-point hardware in inference.

Remark 2.

The high-precision weights accumulate the gradients that are evaluated with their quantized values, and thus we still have the gradient mismatch problem, similar to the existing approaches (see Section II). However, by adding the MSQE regularizer, we encourage the high-precision weights to converge to their quantized values, and we make the gradient descent more accurate.

Learnable regularization coefficient: The regularization coefficient  in (4) is a hyper-parameter that controls the trade-off between the loss and the regularization. It is conventionally fixed ahead of training. However, searching for a good hyper-parameter value is usually time-consuming. Hence, we propose the learnable regularization coefficient, i.e., we let the regularization coefficient be another learnable parameter.

We start training with a small initial value for , i.e., with little regularization. However, we promote the increase of in training by adding a penalty term for a small regularization coefficient, i.e., for , in the cost function (see (5)). The increasing coefficient  reinforces the convergence of high-precision weights to their quantized values for reducing the MSQE regularization term (see Remark 4). In this way, we gradually boost the regularization factor and encourage the soft transition of high-precision weights to their quantized values. It consequently alleviates the gradient mismatch problem that we mentioned in Remark 2.

The cost function in (4) is altered into

(5)

where we introduce a new hyper-parameter , while making the regularization coefficient learnable. We note that the trade-off between the loss and the regularization is now actually controlled by the new parameter  instead of , i.e., the larger the value of , eventually the more the regularization. This transfer is however beneficial since the new parameter  does not directly impact either the loss or the regularization, and we can induce smooth transition of high-precision weights to their quantized values.

(a) Iterations=
(b) Iterations=
(c) Iterations=
(d) Iterations=
Fig. 2: Weight histogram snapshots of the LeNet-5 second convolutional layer captured at different training batch iteration numbers while a pre-trained model is quantized to have 4-bit weights and activations with the proposed regularization method.

Figure 2 presents an example of how high-precision weights are gradually quantized by our regularization scheme. We plotted weight histogram snapshots captured at the second convolutional layer of the LeNet-5 model111https://github.com/BVLC/caffe/tree/master/examples/mnist while a pre-trained model is quantized to a 4-bit fixed-point model. The histograms in the figure from the left to the right correspond to , , , and batch iterations of training, respectively. Observe that the weight distribution gradually converges to the sum of uniformly spaced delta functions and all high-precision weights converge to quantized values completely in the end.

Fig. 3:

LeNet-5 4-bit fixed-point model convergence curves, when trained from scratch. The learning rate is reduced from 0.01 to 0.001 and then 0.0001 at 42 and 85 epochs, respectively.

The proposed weight quantization method is also applicable to learning a quantized DNN from scratch. In Figure 3, we plotted the convergence curves of a 4-bit fixed-point LeNet-5 model trained from scratch. Observe that the regularization term  reduces smoothly, while the cross-entropy network loss 

decreases with some jitters due to stochastic gradient descent. The cross-entropy loss and the top-1 accuracy for the test dataset converge and remain stable.

We note that biases are treated similar to weights. However, for the fixed-point design presented in Section III, we use instead of as the scaling factor in (3), where is the scaling factor for input feature maps (i.e., activations from the previous layer), which is determined by the following activation quantization procedure.

Iv-B Regularization for quantization of activations

We quantize the output activation (feature map)  of layer  for and yield , where is the quantization function in (2) for bit-width  and is the learnable scaling factor for quantized activations of layer .222We note that is the scaling factor for activations of layer  whereas it denotes the scaling factor for input feature maps of layer  in Section III (see Figure 1). This is just one index shift in the notation, since the output of layer  is the input to layer . We adopt this change just for notational simplicity. Similar to (3), we assumed that activation bit-width  is the same for all layers, but this constraint can be easily relaxed to cover more general cases where each layer has a different bit-width. We also assumed ReLU activation and used the unsigned quantization function  while we can replace with in case of general non-linear activation.

We optimize by minimizing the MSQE for activations of layer , i.e., we minimize

(6)

where is the set of activations of layer  for .

Iv-C Regularization for power-of-two scaling

In fixed-point computations, it is more appealing for the scaling factors to be powers of two so they can be implemented by simple bit-shift, rather than with scalar multiplication. To obtain power-of-two scaling factors for weights, we introduce an additional regularizer  given by

(7)

where is a rounding function towards the closest power-of-two number. The scaling factors for activations  can be regularized similarly by

(8)

V Training with regularization

In this section, we derive the gradients for the learnable parameters, i.e., weights, regularization coefficients and scaling factors, when the regularizers for quantization are included in the cost function. The regularized gradients are then used to update the parameters by gradient descent.

V-a Cost function

We first define the cost function combining (5) and (6) for both weight and activation quantization as follows:

(9)

where weights in , the weight regularization coefficient , and scaling factors in and are all learnable parameters; and are two hyper-parameters fixed in training.

For power-of-two scaling, we have additional regularizers provided in Section IV-C as follows:

(10)

similar to and for weights, are the learnable regularization coefficients for power-of-two scaling and are two new hyper-parameters.

V-B Gradients for weights

The gradient of the cost function  in (9) with respect to weight  satisfies

(11)

for weight  of layer , . The first partial derivative in the right side of (11) can be obtained efficiently by the DNN backpropagation algorithm. For backpropgation through the weight quantization function, we use the following approximation similar to straight-through estimator [29]:

(12)

where is an indication function such that it is one if is true and zero otherwise. Namely, we pass the gradient through the quantization function when the weight is within the clipping boundary. Moreover, to give some room for the weight to move around the boundary in stochastic gradient descent, we additionally allow some margin of for and for . Outside the clipping boundary with some margin, we pass zero.

For weight  of layer , , the partial derivative of the regularizer  in (3) satisfies

(13)

almost everywhere except some non-differentiable points of at quantization cell boundaries  given by

(14)

for and . If the weight is located at one of these boundaries, it actually makes no difference to update to either direction of or , in terms of its quantization error. Thus, we let

(15)

From (11)–(15), we finally have

(16)
Remark 3.

If the weight is located at one of the cell boundaries, the weight gradient is solely determined by the network loss function derivative and thus the weight is updated towards the direction to minimize the network loss function. Otherwise, the regularization term impacts the gradient as well and encourages the weight to converge to the closest cell center as far as the loss function changes small. The regularization coefficient trades off these two contributions of the network loss function and the regularization term.

V-C Gradient for the regularization coefficient

The gradient of the cost function for is given by

(17)

Observe that tends to in gradient descent.

Remark 4.

Recall that weights gradually tend to their closest quantization output levels to reduce the regularizer  (see Remark 3). As the regularizer  decreases, the regularization coefficient  gets larger by gradient descent using (17). Then, a larger regularization coefficient further forces weights to move towards quantized values in the following update. In this manner, weights gradually converges to quantized values.

V-D Gradients for scaling factors

For scaling factor optimization, we approximately consider the MSQE regularization term only for simplicity. Using the chain rule for (

3), it follows that

(18)

for . Moreover, it can be shown that

(19)

almost everywhere except some non-differentiable points of satisfying

(20)

for . Similar to (15), we let

(21)

so that the scaling factor  is not impacted by the weights at the cell boundaries. From (18)–(21), it follows that

Similarly, one can derive the gradients for activation scaling factors  from (6) and (9), which we omit here.

V-E Gradients for power-of-two scaling

For power-of-two scaling, the cost function  in (10) has additional regularization terms  and from Section IV-C. Similar to (16), it can be shown that

where is the set of power-of-two rounding boundaries for the set of integers . Moreover, as in (17), we update using

In a similar manner, one can obtain the additional gradients for and from (8) and (10), respectively, which we do not repeat here.

V-F Implementation details

For low-precision DNNs, we define the cost function as (9) and update learnable parameters, i.e., weights, regularization coefficients and scaling factors, by gradient descent. If power-of-two scaling is needed, we use the cost function (10) instead.

Initialization: Provided a pre-trained high-precision model, weight scaling factors  are initialized to cover the dynamic range of the pre-trained weights, e.g., the -th percentile magnitude of the weights in each layer. Similarly, activation scaling factors  are set to cover the dynamic range of the activations in each layer, which are obtained by feeding a small number of training data to the pre-trained model. The regularization coefficients are set to be small initially and let the gradient descent boost them gradually (see Remark 4).

Backpropagation through activation quantization: Backpropagation is not feasible through the activation quantization function analytically since the gradient is zero almost everywhere. For backpropagation through the quantization function, we adopt the straight-through estimator [29]. In particular, we pass the gradient through the quantization function when the input is within the clipping boundary. If the input is outside the clipping boundary, we pass zero.

Complexity: The additional computational complexity for the regularized gradients is not expensive. It only scales in the order of , where is the number of weights. Hence, the proposed algorithm is easily applicable to state-of-the-art DNNs with millions or tens of millions weights.

Remark 5 (Comparison to soft weight sharing).

In soft weight sharing [47, 33], a Gaussian mixture prior is assumed, and the model is regularized to form groups of weights that have similar values around the Gaussian component centers (e.g., see [48, Section 5.5.7]

). Our weight regularization method is different from the soft weight sharing since we consider linear quantization and optimize common scaling factors, instead of optimizing individual Gaussian component centers for non-linear quantization. We furthermore employ the simple MSQE regularization term for quantization, so that it is applicable to large-size DNNs. Note that the soft weight sharing yields the regularization term of the logarithm of the summation of Gaussian probability density functions (i.e., exponential functions), which is sometimes too complex to evaluate in modern DNNs with millions or tens of millions weights.

Vi Experiments

We evaluate the proposed low-precision DNN quantization method for ImageNet classification and image super resolution. Image super resolution is included in our experiments as a regression problem since its accuracy is more sensitive to quantization than classification accuracy. To the best of our knowledge, we are the first to evaluate DNN quantization for regression problems. We compare our approach to current state-of-art techniques, XNOR-Net [7], DoReFa-Net [8], and HWGQ [9].

Vi-a ImageNet classification

We first experiment our quantization scheme on AlexNet [4] and ResNet-18 [5] for ImageNet ILSVRC 2012 dataset [6]. For the AlexNet model, similar to the previous methods in [7, 8, 9]

, we add batch normalization in convolution and fully-connected layers before applying non-linear activations.

In training, we use the Adam optimizer [49]. The learning rate is set to be and we train batches with the batch size of for AlexNet and for ResNet-18, respectively. Then, we decrease the learning rate to and train more batches. We let and in (9). For the learnable regularization coefficient , we let and learn instead in order to make always positive in training. The initial value of is set to be , and it is updated with the Adam optimizer using the learning rate of .

Model Top-1 / Top-5 accuracy (%)
Ours XNOR-Net [7] DoReFa-Net [8] HWGQ [9]
AlexNet 53.0 / 76.8 44.2 / 69.2 49.3 / 74.1* 52.7 / 76.3
ResNet-18 60.4 / 83.3 51.2 / 73.2  N/A 59.6 / 82.2
* from our experiments
TABLE I: Accuracy comparison of the low-precision DNN quantization methods for binary weights and 2-bit activations.

We summarize the accuracy of our low-precision AlexNet and ResNet-18 models of binary weights and 2-bit activations in Table I and compare them to the existing low-precision models from XNOR-Net [7], DoReFa-Net [8] and HWGQ [9]. The results show that our method yields state-of-the-art low-precision models that achieve higher accuracy that the previously available ones. More experimental results for various bit-widths can be found below in Table II and Table III.

Quantized layers Weights Activations Top-1 / Top-5 accuracy (%)
Ours DoReFa-Net [8]*
- 32-bit FLP 32-bit FLP 58.0 / 80.8
(1) All layers 8-bit FXP 8-bit FXP 57.0 / 79.9 57.6 / 80.8
4-bit FXP 4-bit FXP 56.5 / 79.4 56.9 / 80.3
2-bit FXP 2-bit FXP 53.5 / 77.3 43.0 / 68.1
1-bit FXP 8-bit FXP 52.2 / 75.8 47.5 / 72.1
4-bit FXP 52.0 / 75.7 45.1 / 69.7
2-bit FXP 50.5 / 74.6 43.6 / 68.3
1-bit FXP 41.1 / 66.6 19.3 / 38.2
(2) Except the first and the last layers 8-bit FXP 8-bit FXP 57.2 / 79.9 57.5 / 80.7
4-bit FXP 4-bit FXP 56.6 / 79.8 56.9 / 80.1
2-bit FXP 2-bit FXP 54.1 / 77.9 53.1 / 77.3
1-bit FXP 8-bit FXP 54.8 / 78.1 51.2 / 75.5
4-bit FXP 54.8 / 78.2 51.9 / 75.9
2-bit FXP 53.0 / 76.8 49.3 / 74.1
1-bit FXP 43.9 / 69.0 40.2 / 65.5
* from our experiments
TABLE II: AlexNet quantization results for ImageNet classification.
Quantized layers Weights Activations Top-1 / Top-5
accuracy (%)
- 32-bit FLP 32-bit FLP 68.1 / 88.4
(1) All layers 8-bit FXP 8-bit FXP 68.1 / 88.3
4-bit FXP 4-bit FXP 67.4 / 87.9
2-bit FXP 2-bit FXP 60.6 / 83.7
1-bit FXP 8-bit FXP 61.3 / 83.7
4-bit FXP 60.2 / 83.2
2-bit FXP 55.6 / 79.6
1-bit FXP 38.9 / 65.4
(2) Except the first and the last layers 8-bit FXP 8-bit FXP 68.1 / 88.2
4-bit FXP 4-bit FXP 67.3 / 87.9
2-bit FXP 2-bit FXP 61.7 / 84.4
1-bit FXP 8-bit FXP 64.3 / 86.1
4-bit FXP 63.9 / 85.6
2-bit FXP 60.4 / 83.3
1-bit FXP 47.2 / 73.0
TABLE III: ResNet-18 quantization results for ImageNet classification.

In Table II, we compare the performance of our quantization method to DoReFa-Net [8] for the AlexNet model.333The DoReFa-Net results in Table II are (re-)produced by us from their code https://github.com/ppwwyyxx/tensorpack/tree/master/examples/DoReFa-Net. Table III shows the accuracy of the low-precision ResNet-18 models obtained from our quantization method. We evaluate two cases where (1) all layers are quantized, and (2) all layers except the first and the last layers are quantized. We note that the previous work [7, 8, 9] evaluates the case (2) only. The results in Table II and Table III show that 4-bit quantization is needed for accuracy loss less than %. For binary weights, we observe some accuracy loss of more or less than %. However, we can see that our quantization scheme performs better than DoReFa-Net [8] in particular for low-precision cases.

Figure 4 shows the convergence curves of low-precision AlexNet and ResNet-18 models of binary weights and 2-bit activations. The results show that the regularization term consistently decreases up to some level, while weights converge to quantized values. After some level, the regularization term does not decrease further since more regularization requires considerable accuracy loss. After finding this optimal regularization point, the network loss function is further optimized to find the best quantized model. Observe that the cross-entropy loss keeps improving and the test accuracy also increases while the regularization term saturates.

(a) AlexNet
(b) ResNet-18
Fig. 4: Binary weight and 2-bit activation fixed-point model training convergence curves for AlexNet and ResNet-18.

Vi-B Image super resolution

Model Method Weights Activations Set-14 Set-14 PSNR (dB) SSIM
PSNR (dB) SSIM loss loss
SRCNN 3-layer Original model 32-bit FLP 32-bit FLP 29.05 0.8161 - -
Ours 8-bit FXP 8-bit FXP 29.03 0.8141 0.02 0.0020
4-bit FXP 28.99 0.8133 0.06 0.0028
2-bit FXP 28.72 0.8075 0.33 0.0086
1-bit FXP 28.53 0.8000 0.52 0.0161
CT-SRCNN 5-layer Original model 32-bit FLP 32-bit FLP 29.56 0.8273 - -
Ours 8-bit FXP 8-bit FXP 29.54 0.8267 0.02 0.0006
4-bit FXP 29.48 0.8258 0.08 0.0015
2-bit FXP 29.28 0.8201 0.28 0.0072
1-bit FXP 29.09 0.8171 0.47 0.0102
CT-SRCNN 9-layer Original model 32-bit FLP 32-bit FLP 29.71 0.8300 - -
Ours 8-bit FXP 8-bit FXP 29.67 0.8288 0.04 0.0012
4-bit FXP 29.63 0.8285 0.08 0.0015
2-bit FXP 29.37 0.8236 0.34 0.0064
1-bit FXP 29.20 0.8193 0.51 0.0107
Bicubic - - - 27.54 0.7742 - -
TABLE IV: SRCNN and CT-SRCNN quantization results for upscaling factor .

We next evaluate the proposed method on SRCNN [10] and cascade-trained SRCNN (CT-SRCNN) [11] for image super resolution. Initializing their weights with pre-trained ones, we train the quantized models using the Adam optimizer for batches with the batch size of . We use the learning rate of . We set and . Similar in Section VI-A, we let and learn instead of learning directly in order to make always positive in training. The initial value for is set to be and it is updated by the Adam optimizer using the learning rate of .

The average peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [44] are compared for Set-14 image dataset [50] in Table IV. Our experimental results show that our method successfully yields low-precision models of 8-bit weights and activations at negligible loss. It is interesting to see that the PSNR loss of using binary weights and 8-bit activations is  dB only.

Fig. 5: Ablation study for SRCNN quantization. We use “W:,A:” to denote the low-precision model of -bit weights and -bit activations.

Figure 5 provides the ablation study results for 9-layer CT-SRCNN when quantized from a pre-trained model. We compare the PSNR with and without re-training. Observe that re-training yields considerable gain. The figure also shows little performance loss when scaling factors are restricted to power-of-two numbers by our power-of-two scaling regularization.

Vii Further discussion on DNN compression

Although our focus in this paper is mainly on the fixed-point design and the complexity reduction in inference by low-precision DNNs, we can also achieve DNN compression from our weight quantization scheme. In DNN compression, weight pruning plays an important role since the size reduction is huge when the pruning ratio is large, e.g., see [39, 42]. Hence, weight pruning is further investigated in this section.

For weight pruning, we employ the partial L2 regularization term. In particular, given a target pruning ratio , we find the -th percentile of weight magnitude values. Assuming that we prune the weights below this -th percentile value in magnitude, we define a L2 regularizer partially for them as follows:

where is the -th percentile of weight magnitude values, i.e., the threshold for weight pruning. Employing the learnable regularization coefficient as in (5), the cost function for weight pruning is given by

The partial L2 regularizer encourages the weights below the threshold to move together towards zero, while the other weights are not regularized but updated to minimize the performance loss due to pruning. Furthermore, the threshold  is also updated every iteration in training based on the instant weight distribution. We note that the threshold  decreases as training goes on since the regularized weights gradually converge to zero. After finishing the regularized training, we finally have a set of weights clustered very near zero. The loss due to pruning these small weights is negligible.

(a) Deep compression [31]
(b) Ours
Fig. 6: Comparison of DNN compression schemes.

After weight pruning, the pruned model is quantized by re-training with the MSQE regularizer. In this stage, pruned weights are fixed to be zero while unpruned weights are updated for quantization. In Figure 6, we compare our DNN compression pipeline to deep compression [31]. In [31], weight pruning and quantization are performed and then fine-tuning follows separately after each of them. In our compression pipeline, no separate fine-tuning stages are needed. We directly learn the pruned and quantized models by regularization and one final low-precision conversion by linear quantization follows for fixed-point weights. Note that deep compression [31] employs non-linear quantization for size compression only.

Compression Unpruned Top-1 accuracy (%) of Compression
method weights (%) original / compressed models ratio
Han et al. [31] 8.0 99.2 / 99.3 39
Choi et al. [32] 9.0 99.3 / 99.3 51
Guo et al. [42] 0.9 99.1 / 99.1 108
Ullrich et al. [33] 0.5 99.1 / 99.0 162
Molchanov et al. [34] 0.7 99.1 / 99.0 365
Louizos et al. [36] 0.6 99.1 / 99.0 771
Ours 0.7 99.1 / 99.0 401
TABLE V: LeNet-5 compression results.

LeNet-5 compression results: For LeNet-5 compression, we prune 99.0% weights of a pre-trained model as described above. Then, we employ our weight quantization method for 3-bit weights. Note that we do not quantize activations in this experiment for fair comparison to others [31, 32, 42, 33, 34, 36], where only DNN weight pruning and compression are focused. After quantization, the ratio of zero-value weights including the already pruned ones increases from 99.0% to 99.3%, since some unpruned weights fall into zero after quantization. For compression, the 3-bit fixed-point weights of all layers are encoded together by Huffman coding. The unpruned weight indexes are also counted in the model size after compressing by Huffman coding, as suggested in [31]. Comparing to the previous DNN compression schemes, Table V presents that our method yields the good compression ratio for this exemplary DNN model. We emphasize that our quantization scheme is constrained to have low-precision fixed-point weights while other existing compression schemes results in quantized floating-point weights.

Viii Conclusion

We proposed a method to quantize deep neural networks (DNNs) by regularization to produce low-precision DNNs for efficient fixed-point inference. We also suggested the novel learnable regularization coefficient to find the optimal quantization while minimizing the performance loss. Although our training happens in high precision particularly for its backward passes and gradient decent, its forward passes use quantized low-precision weights and activations, and thus the resulting networks can be operated on low-precision fixed-point hardware at inference time. We showed by experiments that the proposed quantization algorithm successfully produces low-precision DNNs of binary weights for classification problems, such as ImageNet classification, as well as for regression and image synthesis problems, such as image super resolution. In particular, for AlexNet and ResNet-18 models, our quantization method produces state-of-the-art low-precision models of binary weights and 2-bit activations achieving the top-1 accuracy of % and %, respectively. For image super resolution, we only lose dB PSNR when using binary weights and 8-bit activations, instead of 32-bit floating-point numbers. Finally, we also discussed how similar regularization techniques can be employed for weight pruning and network compression.

References

  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [2] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  • [3] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, 2018.
  • [4]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 770–778.
  • [6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [7] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision.   Springer, 2016, pp. 525–542.
  • [8] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
  • [9] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by half-wave Gaussian quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5918–5926.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.
  • [11] H. Ren, M. El-Khamy, and J. Lee, “CT-SRCNN: Cascade trained and trimmed deep convolutional neural networks for image super resolution,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
  • [12] M. Courbariaux, J.-P. David, and Y. Bengio, “Training deep neural networks with low precision multiplications,” in International Conference on Learning Representations (ICLR) Workshop, 2015.
  • [13] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in

    Proceedings of the International Conference on Machine Learning

    , 2015, pp. 1737–1746.
  • [14] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of deep convolutional networks,” in International Conference on Machine Learning, 2016, pp. 2849–2858.
  • [15] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented approximation of convolutional neural networks,” in International Conference on Learning Representations (ICLR) Workshop, 2016.
  • [16] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, 2016.
  • [17] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless CNNs with low-precision weights,” in International Conference on Learning Representations, 2017.
  • [18] X. Chen, X. Hu, H. Zhou, and N. Xu, “FxpNet: Training a deep convolutional neural network in fixed-point representation,” in IEEE International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2494–2501.
  • [19] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein, “Training quantized nets: A deeper understanding,” in Advances in Neural Information Processing Systems, 2017, pp. 5813–5823.
  • [20] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems, 2015, pp. 3123–3131.
  • [21] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with few multiplications,” in International Conference on Learning Representations, 2016.
  • [22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4107–4115.
  • [23] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
  • [24] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey, “Ternary neural networks with fine-grained quantization,” arXiv preprint arXiv:1705.01462, 2017.
  • [25] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” in International Conference on Learning Representations, 2017.
  • [26] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks,” in International Conference on Learning Representations, 2017.
  • [27] L. Hou and J. T. Kwok, “Loss-aware weight quantization of deep networks,” in International Conference on Learning Representations, 2018.
  • [28] M. Höhfeld and S. E. Fahlman, “Probabilistic rounding in neural network learning with limited precision,” Neurocomputing, vol. 4, no. 6, pp. 291–299, 1992.
  • [29] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
  • [30] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
  • [31] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in International Conference on Learning Representations, 2016.
  • [32] Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantization,” in International Conference on Learning Representations, 2017.
  • [33] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression,” in International Conference on Learning Representations, 2017.
  • [34] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning, 2017, pp. 2498–2507.
  • [35]

    E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in

    Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.
  • [36] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3290–3300.
  • [37] E. Park, J. Ahn, and S. Yoo, “Weighted-entropy-based quantization for deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 7197–7205.
  • [38] Y. Choi, M. El-Khamy, and J. Lee, “Universal deep neural network compression,” arXiv preprint arXiv:1802.02271, 2018.
  • [39] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
  • [40] V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain damage,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2554–2564.
  • [41] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
  • [42] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient DNNs,” in Advances In Neural Information Processing Systems, 2016, pp. 1379–1387.
  • [43] J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” in Advances in Neural Information Processing Systems, 2017, pp. 2178–2188.
  • [44] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [45] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in IEEE International Solid-State Circuits Conference, 2014, pp. 10–14.
  • [46] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression.   Springer Science & Business Media, 2012, vol. 159.
  • [47] S. J. Nowlan and G. E. Hinton, “Simplifying neural networks by soft weight-sharing,” Neural Computation, vol. 4, no. 4, pp. 473–493, 1992.
  • [48] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer, 2006.
  • [49] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [50] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International Conference on Curves and Surfaces.   Springer, 2010, pp. 711–730.