I Introduction
Deep neural networks (DNNs) have recently achieved performance breakthroughs in many of computer vision tasks
[1]. The stateoftheart performance of modern DNNs comes with overparametrized complex structures, and nowadays millions or tens of millions parameters in more than one hundred layers are not exceptional anymore. DNN quantization for efficient inference is of great interest particularly for deployment of largesize DNNs on resourcelimited platforms such as batterypowered mobile devices (e.g., see [2, 3]). In such hardware platforms, not only memory and power are limited but also basic floatingpoint arithmetic operations are in some cases not supported. Hence, it is preferred and sometimes necessary to deliver fixedpoint DNN models of lowprecision weights and activations (feature maps) for curtailing memory requirements and reducing computational costs.
In this paper, we propose a DNN quantization method by regularization to generate quantized models of lowprecision weights and activations with minimal quantization loss.

For weight quantization, we propose training DNNs with the regularization term of the meansquaredquantizationerror (MSQE) for weights. The loss due to quantization is explicitly reduced by the MSQE regularizer. In particular, we define the regularization coefficient as a learnable parameter to derive accurate lowprecision models efficiently in training. The common scaling factor (i.e., quantization cell size) for weights in each layer is set to be a learnable parameter as well and is optimized in training to minimize the MSQE.

We quantize the activation output of each layer and pass quantized activations as the input to the following layer. Similar to weight quantization, the common scaling factor for quantized activations of each layer is optimized while training to minimize the MSQE.

We furthermore present a method of regularizing poweroftwo scaling factors for weights and activations, which can be added optionally to the proposed regularization method for DNN quantization. Poweroftwo scaling can be computationally advantageous when implemented by simple bitshift rather than scalar multiplication.
Using the proposed DNN quantization scheme, we obtain lowprecision AlexNet [4] and ResNet18 [5]
models that achieve higher accuracy in ImageNet classification
[6]than their previously available lowprecision models from XNORNet
[7], DoReFaNet [8] and HWGQ [9]. We utilize our quantization method to produce lowprecision (CT)SRCNN [10, 11] models of binary weights and 8bit activations for image super resolution, and observe only dB peak signaltonoise ratio (PSNR) loss. We finally discuss how our regularization method can be altered for DNN weight pruning and compression, and the compression ratio of is achieved for LeNet5.Ii Previous work and our contributions
Lowprecision DNNs have been studied extensively in deep learning recently
[12, 13, 14, 15, 16, 17, 18, 19]. Some extremes of lowprecision DNNs of binary or ternary weights can be found in [20, 21, 22, 7, 8, 23, 24, 25, 9, 26, 27]. The previous work either focused on quantization of pretrained models with/without retraining or considered training lowprecision models from scratch.To train lowprecision DNNs, a series of papers on binary neural networks [20, 21, 22]
suggests the method of utilizing highprecision shadow weights to accumulate highprecision gradient values in training. The highprecision weights are binarized (or quantized) after being updated in every training iteration, and then the gradients are computed from the network loss function evaluated with the binarized (or quantized) weights. In this framework, stochastic rounding
[28, 13]is examined for better convergence instead of deterministic rounding and the straightthrough estimator
[29]is employed for the backpropagation of gradients through the quantization function. These training techniques are further optimized and enhanced in the subsequent work
[7, 8, 25, 9, 26, 27]. Scaling factors for quantized weights and activations are either fixed before training or heuristically updated in training, e.g., by the overflow rate
[12, 18].In this paper, we propose a method for learning quantized lowprecision models through regularization. The quantization loss is explicitly defined in the MSQE regularization term and the network is trained to reduce it as much as possible, along with the main target (e.g., classification) loss function. In particular, the highprecision weights gradually converge around the quantization centroids as we minimize the MSQE in training, and therefore the gradient descent becomes more accurate than the conventional methods without regularization. Moreover, using the learnable regularization coefficient, the network is guided to reach an accurate quantized model with smooth convergence. The scaling factors for quantized weights and activations are also optimized systematically to minimize the MSQE in training.
We emphasize that our DNN quantization scheme through regularization is different from the lossaware weight quantization in [26, 27], where approximate solutions using the proximal Newton algorithm are presented to minimize the network loss function under the constraints of lowprecision weights. No regularization is considered in [26, 27].
Weight sharing is another DNN compression scheme studied in [30, 31, 32, 33, 34, 35, 36, 37, 38]. It reduces the number of distinct weight values in DNNs by quantization. Contrary to lowprecision weights from linear quantization, weight sharing allows nonlinear quantization, where quantization output levels do not have to be evenly spaced. Hence, quantized weights from weight sharing are represented in high precision, implying that highprecision arithmetic operations are still needed in inference, although we compress them in size by lossless source coding.
Weight pruning is a special case of weight sharing where the shared value is zero. It curtails redundant weights completely from DNNs so that one can even skip computations for pruned ones. Some of successful pruning algorithms can be found in [39, 40, 41, 42, 43]. In this paper, we discuss how regularization can be used for weight pruning and show that we achieve compression for the exemplary LeNet5 model by combining our weight pruning and quantization schemes.
To the best of our knowledge, we are also the first to evaluate lowprecision DNNs for a regression problem, i.e., image super resolution. The image super resolution problem is to synthesize a highresolution image from a lowresolution one. The DNN output is the highresolution image corresponding to the input lowresolution image, and thus the loss due to quantization is more prominent. Using the proposed quantization method, we show by experiments that we can quantize super resolution DNNs successfully with binary weights and 8bit activations at marginal accuracy loss in both the objective image quality metric measured by the peak signaltonoise ratio (PSNR) and the perceptual score measured by the structured similarity index (SSIM) [44].
Iii Lowprecision DNN model
We consider lowprecision DNNs that are capable of efficient processing in the inference stage by using fixedpoint arithmetic operations. In particular, we focus on the fixedpoint implementation of convolutional and fullyconnected layers, since they are the dominant parts of computational costs and memory requirements in DNNs [2, Table II].
The major bottleneck of efficient DNN processing is known to be in memory accesses [2, Section VB]. Horowitz provides rough energy costs of various arithmetic and memory access operations for 45 nm technology in [45, Figure 1.1.9], where we can find that memory accesses typically consume more energy than arithmetic operations, and the memory access cost increases with the read size. Hence, for example, deploying binary models, instead of 32bit models, it is expected to reduce their energy consumption by at least, due to times fewer memory accesses.
Lowprecision weights and activations basically stem from linear quantization (e.g., see [46, Section 5.4]), where quantization cell boundaries are uniformly spaced and quantization output levels are the midpoints of cell intervals. Quantized weights and activations are represented by fixedpoint numbers of small bitwidth. Common scaling factors (i.e., quantization cell sizes) are defined in each layer for fixedpoint weights and activations, respectively, to alter their dynamic ranges.
Figure 1 shows the fixedpoint design of a general convolutional layer consisting of convolution, bias addition and nonlinear activation. Fixedpoint weights and input feature maps are given with common scaling factors and , respectively, where is the layer index. Then, the convolution operation can be implemented by fixedpoint multipliers and accumulators. Biases are added, if present, after the convolution, and then the output is scaled properly by the product of the scaling factors for weights and input feature maps, i.e., , as shown in the figure. Here, the scaling factor for the biases is specially set to be
so that fixedpoint bias addition can be done easily without another scaling. Then, a nonlinear activation function follows. Finally, the output activations are fed into the next layer as the input.
Using rectified linear unit (ReLU) activation, two scaling operations across two layers, i.e., scaling operations by
and , can be combined into one scaling operation by before (or after) ReLU activation. Furthermore, if the scaling factors are poweroftwo numbers, then one can even implement scaling by bitshift. Similarly, lowprecision fullyconnected layers can be implemented by replacing convolution with matrix multiplication in the figure.Iv Regularization for quantization
In this section, we present the regularizers that are utilized to learn the quantized DNNs of lowprecision weights and activations. We first define the quantization function. Given the number of bits, i.e., bitwidth , the quantization function yields
(1) 
where is the input and is the scaling factor; the rounding and clipping functions satisfy
where is the largest integer smaller than or equal to . For ReLU activation, the ReLU output is always nonnegative, and thus we use the unsigned quantization function given by
(2) 
for , where .
Iva Regularization for weight quantization
Consider a general nonlinear DNN model consisting of layers. Let be the sets of weights in layers to , respectively. For notational simplicity, we let
for any symbol . For weight quantization, we define the MSQE regularizer for weights of all layers as
(3) 
where is the bitwidth for quantized weights, is the scaling factor (i.e., quantization cell size) for quantized weights in layer , and is the total number of weights in all layers, i.e., . We assumed that bitwidth is the same for all layers, just for notational simplicity, but it can be easily extended to more general cases such that each layer has a different bitwidth.
Including the MSQE regularizer in (3), the cost function to optimize in training is given by
(4) 
where, with a slight abuse of notation, denotes the set of quantized weights of all layers, is the target loss function evaluated on the training dataset using all the quantized weights, and is the regularization coefficient, for . We set the scaling factors to be learnable parameters and optimize them along with weights .
Remark 1.
We clarify that our training uses high precision for its backward passes and gradient decent. However, its forward passes use quantized lowprecision weights and activations, and the main target network loss function is also calculated with the quantized lowprecision weights and activations to mimic the lowprecision inferencestage loss. Hence, the final trained models are lowprecision models, which can be operated on lowprecision fixedpoint hardware in inference.
Remark 2.
The highprecision weights accumulate the gradients that are evaluated with their quantized values, and thus we still have the gradient mismatch problem, similar to the existing approaches (see Section II). However, by adding the MSQE regularizer, we encourage the highprecision weights to converge to their quantized values, and we make the gradient descent more accurate.
Learnable regularization coefficient: The regularization coefficient in (4) is a hyperparameter that controls the tradeoff between the loss and the regularization. It is conventionally fixed ahead of training. However, searching for a good hyperparameter value is usually timeconsuming. Hence, we propose the learnable regularization coefficient, i.e., we let the regularization coefficient be another learnable parameter.
We start training with a small initial value for , i.e., with little regularization. However, we promote the increase of in training by adding a penalty term for a small regularization coefficient, i.e., for , in the cost function (see (5)). The increasing coefficient reinforces the convergence of highprecision weights to their quantized values for reducing the MSQE regularization term (see Remark 4). In this way, we gradually boost the regularization factor and encourage the soft transition of highprecision weights to their quantized values. It consequently alleviates the gradient mismatch problem that we mentioned in Remark 2.
The cost function in (4) is altered into
(5) 
where we introduce a new hyperparameter , while making the regularization coefficient learnable. We note that the tradeoff between the loss and the regularization is now actually controlled by the new parameter instead of , i.e., the larger the value of , eventually the more the regularization. This transfer is however beneficial since the new parameter does not directly impact either the loss or the regularization, and we can induce smooth transition of highprecision weights to their quantized values.
Figure 2 presents an example of how highprecision weights are gradually quantized by our regularization scheme. We plotted weight histogram snapshots captured at the second convolutional layer of the LeNet5 model^{1}^{1}1https://github.com/BVLC/caffe/tree/master/examples/mnist while a pretrained model is quantized to a 4bit fixedpoint model. The histograms in the figure from the left to the right correspond to , , , and batch iterations of training, respectively. Observe that the weight distribution gradually converges to the sum of uniformly spaced delta functions and all highprecision weights converge to quantized values completely in the end.
The proposed weight quantization method is also applicable to learning a quantized DNN from scratch. In Figure 3, we plotted the convergence curves of a 4bit fixedpoint LeNet5 model trained from scratch. Observe that the regularization term reduces smoothly, while the crossentropy network loss
decreases with some jitters due to stochastic gradient descent. The crossentropy loss and the top1 accuracy for the test dataset converge and remain stable.
We note that biases are treated similar to weights. However, for the fixedpoint design presented in Section III, we use instead of as the scaling factor in (3), where is the scaling factor for input feature maps (i.e., activations from the previous layer), which is determined by the following activation quantization procedure.
IvB Regularization for quantization of activations
We quantize the output activation (feature map) of layer for and yield , where is the quantization function in (2) for bitwidth and is the learnable scaling factor for quantized activations of layer .^{2}^{2}2We note that is the scaling factor for activations of layer whereas it denotes the scaling factor for input feature maps of layer in Section III (see Figure 1). This is just one index shift in the notation, since the output of layer is the input to layer . We adopt this change just for notational simplicity. Similar to (3), we assumed that activation bitwidth is the same for all layers, but this constraint can be easily relaxed to cover more general cases where each layer has a different bitwidth. We also assumed ReLU activation and used the unsigned quantization function while we can replace with in case of general nonlinear activation.
We optimize by minimizing the MSQE for activations of layer , i.e., we minimize
(6) 
where is the set of activations of layer for .
IvC Regularization for poweroftwo scaling
In fixedpoint computations, it is more appealing for the scaling factors to be powers of two so they can be implemented by simple bitshift, rather than with scalar multiplication. To obtain poweroftwo scaling factors for weights, we introduce an additional regularizer given by
(7) 
where is a rounding function towards the closest poweroftwo number. The scaling factors for activations can be regularized similarly by
(8) 
V Training with regularization
In this section, we derive the gradients for the learnable parameters, i.e., weights, regularization coefficients and scaling factors, when the regularizers for quantization are included in the cost function. The regularized gradients are then used to update the parameters by gradient descent.
Va Cost function
We first define the cost function combining (5) and (6) for both weight and activation quantization as follows:
(9) 
where weights in , the weight regularization coefficient , and scaling factors in and are all learnable parameters; and are two hyperparameters fixed in training.
For poweroftwo scaling, we have additional regularizers provided in Section IVC as follows:
(10) 
similar to and for weights, are the learnable regularization coefficients for poweroftwo scaling and are two new hyperparameters.
VB Gradients for weights
The gradient of the cost function in (9) with respect to weight satisfies
(11) 
for weight of layer , . The first partial derivative in the right side of (11) can be obtained efficiently by the DNN backpropagation algorithm. For backpropgation through the weight quantization function, we use the following approximation similar to straightthrough estimator [29]:
(12) 
where is an indication function such that it is one if is true and zero otherwise. Namely, we pass the gradient through the quantization function when the weight is within the clipping boundary. Moreover, to give some room for the weight to move around the boundary in stochastic gradient descent, we additionally allow some margin of for and for . Outside the clipping boundary with some margin, we pass zero.
For weight of layer , , the partial derivative of the regularizer in (3) satisfies
(13) 
almost everywhere except some nondifferentiable points of at quantization cell boundaries given by
(14) 
for and . If the weight is located at one of these boundaries, it actually makes no difference to update to either direction of or , in terms of its quantization error. Thus, we let
(15) 
From (11)–(15), we finally have
(16) 
Remark 3.
If the weight is located at one of the cell boundaries, the weight gradient is solely determined by the network loss function derivative and thus the weight is updated towards the direction to minimize the network loss function. Otherwise, the regularization term impacts the gradient as well and encourages the weight to converge to the closest cell center as far as the loss function changes small. The regularization coefficient trades off these two contributions of the network loss function and the regularization term.
VC Gradient for the regularization coefficient
The gradient of the cost function for is given by
(17) 
Observe that tends to in gradient descent.
Remark 4.
Recall that weights gradually tend to their closest quantization output levels to reduce the regularizer (see Remark 3). As the regularizer decreases, the regularization coefficient gets larger by gradient descent using (17). Then, a larger regularization coefficient further forces weights to move towards quantized values in the following update. In this manner, weights gradually converges to quantized values.
VD Gradients for scaling factors
For scaling factor optimization, we approximately consider the MSQE regularization term only for simplicity. Using the chain rule for (
3), it follows that(18) 
for . Moreover, it can be shown that
(19) 
almost everywhere except some nondifferentiable points of satisfying
(20) 
for . Similar to (15), we let
(21) 
so that the scaling factor is not impacted by the weights at the cell boundaries. From (18)–(21), it follows that
Similarly, one can derive the gradients for activation scaling factors from (6) and (9), which we omit here.
VE Gradients for poweroftwo scaling
For poweroftwo scaling, the cost function in (10) has additional regularization terms and from Section IVC. Similar to (16), it can be shown that
where is the set of poweroftwo rounding boundaries for the set of integers . Moreover, as in (17), we update using
In a similar manner, one can obtain the additional gradients for and from (8) and (10), respectively, which we do not repeat here.
VF Implementation details
For lowprecision DNNs, we define the cost function as (9) and update learnable parameters, i.e., weights, regularization coefficients and scaling factors, by gradient descent. If poweroftwo scaling is needed, we use the cost function (10) instead.
Initialization: Provided a pretrained highprecision model, weight scaling factors are initialized to cover the dynamic range of the pretrained weights, e.g., the th percentile magnitude of the weights in each layer. Similarly, activation scaling factors are set to cover the dynamic range of the activations in each layer, which are obtained by feeding a small number of training data to the pretrained model. The regularization coefficients are set to be small initially and let the gradient descent boost them gradually (see Remark 4).
Backpropagation through activation quantization: Backpropagation is not feasible through the activation quantization function analytically since the gradient is zero almost everywhere. For backpropagation through the quantization function, we adopt the straightthrough estimator [29]. In particular, we pass the gradient through the quantization function when the input is within the clipping boundary. If the input is outside the clipping boundary, we pass zero.
Complexity: The additional computational complexity for the regularized gradients is not expensive. It only scales in the order of , where is the number of weights. Hence, the proposed algorithm is easily applicable to stateoftheart DNNs with millions or tens of millions weights.
Remark 5 (Comparison to soft weight sharing).
In soft weight sharing [47, 33], a Gaussian mixture prior is assumed, and the model is regularized to form groups of weights that have similar values around the Gaussian component centers (e.g., see [48, Section 5.5.7]
). Our weight regularization method is different from the soft weight sharing since we consider linear quantization and optimize common scaling factors, instead of optimizing individual Gaussian component centers for nonlinear quantization. We furthermore employ the simple MSQE regularization term for quantization, so that it is applicable to largesize DNNs. Note that the soft weight sharing yields the regularization term of the logarithm of the summation of Gaussian probability density functions (i.e., exponential functions), which is sometimes too complex to evaluate in modern DNNs with millions or tens of millions weights.
Vi Experiments
We evaluate the proposed lowprecision DNN quantization method for ImageNet classification and image super resolution. Image super resolution is included in our experiments as a regression problem since its accuracy is more sensitive to quantization than classification accuracy. To the best of our knowledge, we are the first to evaluate DNN quantization for regression problems. We compare our approach to current stateofart techniques, XNORNet [7], DoReFaNet [8], and HWGQ [9].
Via ImageNet classification
We first experiment our quantization scheme on AlexNet [4] and ResNet18 [5] for ImageNet ILSVRC 2012 dataset [6]. For the AlexNet model, similar to the previous methods in [7, 8, 9]
, we add batch normalization in convolution and fullyconnected layers before applying nonlinear activations.
In training, we use the Adam optimizer [49]. The learning rate is set to be and we train batches with the batch size of for AlexNet and for ResNet18, respectively. Then, we decrease the learning rate to and train more batches. We let and in (9). For the learnable regularization coefficient , we let and learn instead in order to make always positive in training. The initial value of is set to be , and it is updated with the Adam optimizer using the learning rate of .
Model  Top1 / Top5 accuracy (%)  
Ours  XNORNet [7]  DoReFaNet [8]  HWGQ [9]  
AlexNet  53.0 / 76.8  44.2 / 69.2  49.3 / 74.1*  52.7 / 76.3 
ResNet18  60.4 / 83.3  51.2 / 73.2  N/A  59.6 / 82.2 
* from our experiments 
We summarize the accuracy of our lowprecision AlexNet and ResNet18 models of binary weights and 2bit activations in Table I and compare them to the existing lowprecision models from XNORNet [7], DoReFaNet [8] and HWGQ [9]. The results show that our method yields stateoftheart lowprecision models that achieve higher accuracy that the previously available ones. More experimental results for various bitwidths can be found below in Table II and Table III.
Quantized layers  Weights  Activations  Top1 / Top5 accuracy (%)  
Ours  DoReFaNet [8]*  
  32bit FLP  32bit FLP  58.0 / 80.8  
(1) All layers  8bit FXP  8bit FXP  57.0 / 79.9  57.6 / 80.8 
4bit FXP  4bit FXP  56.5 / 79.4  56.9 / 80.3  
2bit FXP  2bit FXP  53.5 / 77.3  43.0 / 68.1  
1bit FXP  8bit FXP  52.2 / 75.8  47.5 / 72.1  
4bit FXP  52.0 / 75.7  45.1 / 69.7  
2bit FXP  50.5 / 74.6  43.6 / 68.3  
1bit FXP  41.1 / 66.6  19.3 / 38.2  
(2) Except the first and the last layers  8bit FXP  8bit FXP  57.2 / 79.9  57.5 / 80.7 
4bit FXP  4bit FXP  56.6 / 79.8  56.9 / 80.1  
2bit FXP  2bit FXP  54.1 / 77.9  53.1 / 77.3  
1bit FXP  8bit FXP  54.8 / 78.1  51.2 / 75.5  
4bit FXP  54.8 / 78.2  51.9 / 75.9  
2bit FXP  53.0 / 76.8  49.3 / 74.1  
1bit FXP  43.9 / 69.0  40.2 / 65.5  
* from our experiments 
Quantized layers  Weights  Activations  Top1 / Top5 
accuracy (%)  
  32bit FLP  32bit FLP  68.1 / 88.4 
(1) All layers  8bit FXP  8bit FXP  68.1 / 88.3 
4bit FXP  4bit FXP  67.4 / 87.9  
2bit FXP  2bit FXP  60.6 / 83.7  
1bit FXP  8bit FXP  61.3 / 83.7  
4bit FXP  60.2 / 83.2  
2bit FXP  55.6 / 79.6  
1bit FXP  38.9 / 65.4  
(2) Except the first and the last layers  8bit FXP  8bit FXP  68.1 / 88.2 
4bit FXP  4bit FXP  67.3 / 87.9  
2bit FXP  2bit FXP  61.7 / 84.4  
1bit FXP  8bit FXP  64.3 / 86.1  
4bit FXP  63.9 / 85.6  
2bit FXP  60.4 / 83.3  
1bit FXP  47.2 / 73.0 
In Table II, we compare the performance of our quantization method to DoReFaNet [8] for the AlexNet model.^{3}^{3}3The DoReFaNet results in Table II are (re)produced by us from their code https://github.com/ppwwyyxx/tensorpack/tree/master/examples/DoReFaNet. Table III shows the accuracy of the lowprecision ResNet18 models obtained from our quantization method. We evaluate two cases where (1) all layers are quantized, and (2) all layers except the first and the last layers are quantized. We note that the previous work [7, 8, 9] evaluates the case (2) only. The results in Table II and Table III show that 4bit quantization is needed for accuracy loss less than %. For binary weights, we observe some accuracy loss of more or less than %. However, we can see that our quantization scheme performs better than DoReFaNet [8] in particular for lowprecision cases.
Figure 4 shows the convergence curves of lowprecision AlexNet and ResNet18 models of binary weights and 2bit activations. The results show that the regularization term consistently decreases up to some level, while weights converge to quantized values. After some level, the regularization term does not decrease further since more regularization requires considerable accuracy loss. After finding this optimal regularization point, the network loss function is further optimized to find the best quantized model. Observe that the crossentropy loss keeps improving and the test accuracy also increases while the regularization term saturates.
ViB Image super resolution
Model  Method  Weights  Activations  Set14  Set14  PSNR (dB)  SSIM 

PSNR (dB)  SSIM  loss  loss  
SRCNN 3layer  Original model  32bit FLP  32bit FLP  29.05  0.8161     
Ours  8bit FXP  8bit FXP  29.03  0.8141  0.02  0.0020  
4bit FXP  28.99  0.8133  0.06  0.0028  
2bit FXP  28.72  0.8075  0.33  0.0086  
1bit FXP  28.53  0.8000  0.52  0.0161  
CTSRCNN 5layer  Original model  32bit FLP  32bit FLP  29.56  0.8273     
Ours  8bit FXP  8bit FXP  29.54  0.8267  0.02  0.0006  
4bit FXP  29.48  0.8258  0.08  0.0015  
2bit FXP  29.28  0.8201  0.28  0.0072  
1bit FXP  29.09  0.8171  0.47  0.0102  
CTSRCNN 9layer  Original model  32bit FLP  32bit FLP  29.71  0.8300     
Ours  8bit FXP  8bit FXP  29.67  0.8288  0.04  0.0012  
4bit FXP  29.63  0.8285  0.08  0.0015  
2bit FXP  29.37  0.8236  0.34  0.0064  
1bit FXP  29.20  0.8193  0.51  0.0107  
Bicubic        27.54  0.7742     
We next evaluate the proposed method on SRCNN [10] and cascadetrained SRCNN (CTSRCNN) [11] for image super resolution. Initializing their weights with pretrained ones, we train the quantized models using the Adam optimizer for batches with the batch size of . We use the learning rate of . We set and . Similar in Section VIA, we let and learn instead of learning directly in order to make always positive in training. The initial value for is set to be and it is updated by the Adam optimizer using the learning rate of .
The average peak signaltonoise ratio (PSNR) and structural similarity (SSIM) [44] are compared for Set14 image dataset [50] in Table IV. Our experimental results show that our method successfully yields lowprecision models of 8bit weights and activations at negligible loss. It is interesting to see that the PSNR loss of using binary weights and 8bit activations is dB only.
Figure 5 provides the ablation study results for 9layer CTSRCNN when quantized from a pretrained model. We compare the PSNR with and without retraining. Observe that retraining yields considerable gain. The figure also shows little performance loss when scaling factors are restricted to poweroftwo numbers by our poweroftwo scaling regularization.
Vii Further discussion on DNN compression
Although our focus in this paper is mainly on the fixedpoint design and the complexity reduction in inference by lowprecision DNNs, we can also achieve DNN compression from our weight quantization scheme. In DNN compression, weight pruning plays an important role since the size reduction is huge when the pruning ratio is large, e.g., see [39, 42]. Hence, weight pruning is further investigated in this section.
For weight pruning, we employ the partial L2 regularization term. In particular, given a target pruning ratio , we find the th percentile of weight magnitude values. Assuming that we prune the weights below this th percentile value in magnitude, we define a L2 regularizer partially for them as follows:
where is the th percentile of weight magnitude values, i.e., the threshold for weight pruning. Employing the learnable regularization coefficient as in (5), the cost function for weight pruning is given by
The partial L2 regularizer encourages the weights below the threshold to move together towards zero, while the other weights are not regularized but updated to minimize the performance loss due to pruning. Furthermore, the threshold is also updated every iteration in training based on the instant weight distribution. We note that the threshold decreases as training goes on since the regularized weights gradually converge to zero. After finishing the regularized training, we finally have a set of weights clustered very near zero. The loss due to pruning these small weights is negligible.
After weight pruning, the pruned model is quantized by retraining with the MSQE regularizer. In this stage, pruned weights are fixed to be zero while unpruned weights are updated for quantization. In Figure 6, we compare our DNN compression pipeline to deep compression [31]. In [31], weight pruning and quantization are performed and then finetuning follows separately after each of them. In our compression pipeline, no separate finetuning stages are needed. We directly learn the pruned and quantized models by regularization and one final lowprecision conversion by linear quantization follows for fixedpoint weights. Note that deep compression [31] employs nonlinear quantization for size compression only.
Compression  Unpruned  Top1 accuracy (%) of  Compression 

method  weights (%)  original / compressed models  ratio 
Han et al. [31]  8.0  99.2 / 99.3  39 
Choi et al. [32]  9.0  99.3 / 99.3  51 
Guo et al. [42]  0.9  99.1 / 99.1  108 
Ullrich et al. [33]  0.5  99.1 / 99.0  162 
Molchanov et al. [34]  0.7  99.1 / 99.0  365 
Louizos et al. [36]  0.6  99.1 / 99.0  771 
Ours  0.7  99.1 / 99.0  401 
LeNet5 compression results: For LeNet5 compression, we prune 99.0% weights of a pretrained model as described above. Then, we employ our weight quantization method for 3bit weights. Note that we do not quantize activations in this experiment for fair comparison to others [31, 32, 42, 33, 34, 36], where only DNN weight pruning and compression are focused. After quantization, the ratio of zerovalue weights including the already pruned ones increases from 99.0% to 99.3%, since some unpruned weights fall into zero after quantization. For compression, the 3bit fixedpoint weights of all layers are encoded together by Huffman coding. The unpruned weight indexes are also counted in the model size after compressing by Huffman coding, as suggested in [31]. Comparing to the previous DNN compression schemes, Table V presents that our method yields the good compression ratio for this exemplary DNN model. We emphasize that our quantization scheme is constrained to have lowprecision fixedpoint weights while other existing compression schemes results in quantized floatingpoint weights.
Viii Conclusion
We proposed a method to quantize deep neural networks (DNNs) by regularization to produce lowprecision DNNs for efficient fixedpoint inference. We also suggested the novel learnable regularization coefficient to find the optimal quantization while minimizing the performance loss. Although our training happens in high precision particularly for its backward passes and gradient decent, its forward passes use quantized lowprecision weights and activations, and thus the resulting networks can be operated on lowprecision fixedpoint hardware at inference time. We showed by experiments that the proposed quantization algorithm successfully produces lowprecision DNNs of binary weights for classification problems, such as ImageNet classification, as well as for regression and image synthesis problems, such as image super resolution. In particular, for AlexNet and ResNet18 models, our quantization method produces stateoftheart lowprecision models of binary weights and 2bit activations achieving the top1 accuracy of % and %, respectively. For image super resolution, we only lose dB PSNR when using binary weights and 8bit activations, instead of 32bit floatingpoint numbers. Finally, we also discussed how similar regularization techniques can be employed for weight pruning and network compression.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [2] V. Sze, Y.H. Chen, T.J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
 [3] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, 2018.

[4]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. 
[5]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 770–778.  [6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [7] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNORNet: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 525–542.
 [8] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFaNet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
 [9] Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with low precision by halfwave Gaussian quantization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5918–5926.
 [10] C. Dong, C. C. Loy, K. He, and X. Tang, “Image superresolution using deep convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295–307, 2016.
 [11] H. Ren, M. ElKhamy, and J. Lee, “CTSRCNN: Cascade trained and trimmed deep convolutional neural networks for image super resolution,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
 [12] M. Courbariaux, J.P. David, and Y. Bengio, “Training deep neural networks with low precision multiplications,” in International Conference on Learning Representations (ICLR) Workshop, 2015.

[13]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with
limited numerical precision,” in
Proceedings of the International Conference on Machine Learning
, 2015, pp. 1737–1746.  [14] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of deep convolutional networks,” in International Conference on Machine Learning, 2016, pp. 2849–2858.
 [15] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardwareoriented approximation of convolutional neural networks,” in International Conference on Learning Representations (ICLR) Workshop, 2016.
 [16] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, 2016.
 [17] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless CNNs with lowprecision weights,” in International Conference on Learning Representations, 2017.
 [18] X. Chen, X. Hu, H. Zhou, and N. Xu, “FxpNet: Training a deep convolutional neural network in fixedpoint representation,” in IEEE International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2494–2501.
 [19] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein, “Training quantized nets: A deeper understanding,” in Advances in Neural Information Processing Systems, 2017, pp. 5813–5823.
 [20] M. Courbariaux, Y. Bengio, and J.P. David, “BinaryConnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems, 2015, pp. 3123–3131.
 [21] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with few multiplications,” in International Conference on Learning Representations, 2016.
 [22] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4107–4115.
 [23] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
 [24] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey, “Ternary neural networks with finegrained quantization,” arXiv preprint arXiv:1705.01462, 2017.
 [25] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” in International Conference on Learning Representations, 2017.
 [26] L. Hou, Q. Yao, and J. T. Kwok, “Lossaware binarization of deep networks,” in International Conference on Learning Representations, 2017.
 [27] L. Hou and J. T. Kwok, “Lossaware weight quantization of deep networks,” in International Conference on Learning Representations, 2018.
 [28] M. Höhfeld and S. E. Fahlman, “Probabilistic rounding in neural network learning with limited precision,” Neurocomputing, vol. 4, no. 6, pp. 291–299, 1992.
 [29] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
 [30] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
 [31] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in International Conference on Learning Representations, 2016.
 [32] Y. Choi, M. ElKhamy, and J. Lee, “Towards the limit of network quantization,” in International Conference on Learning Representations, 2017.
 [33] K. Ullrich, E. Meeds, and M. Welling, “Soft weightsharing for neural network compression,” in International Conference on Learning Representations, 2017.
 [34] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning, 2017, pp. 2498–2507.

[35]
E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Softtohard vector quantization for endtoend learning compressible representations,” in
Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.  [36] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3290–3300.
 [37] E. Park, J. Ahn, and S. Yoo, “Weightedentropybased quantization for deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 7197–7205.
 [38] Y. Choi, M. ElKhamy, and J. Lee, “Universal deep neural network compression,” arXiv preprint arXiv:1802.02271, 2018.
 [39] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
 [40] V. Lebedev and V. Lempitsky, “Fast convnets using groupwise brain damage,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2554–2564.
 [41] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
 [42] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient DNNs,” in Advances In Neural Information Processing Systems, 2016, pp. 1379–1387.
 [43] J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” in Advances in Neural Information Processing Systems, 2017, pp. 2178–2188.
 [44] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [45] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in IEEE International SolidState Circuits Conference, 2014, pp. 10–14.
 [46] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Springer Science & Business Media, 2012, vol. 159.
 [47] S. J. Nowlan and G. E. Hinton, “Simplifying neural networks by soft weightsharing,” Neural Computation, vol. 4, no. 4, pp. 473–493, 1992.
 [48] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
 [49] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [50] R. Zeyde, M. Elad, and M. Protter, “On single image scaleup using sparserepresentations,” in International Conference on Curves and Surfaces. Springer, 2010, pp. 711–730.
Comments
There are no comments yet.