Position-based Scaled Gradient for Model Quantization and Sparse Training

by   Jangho Kim, et al.
Seoul National University

We propose the position-based scaled gradient (PSG) that scales the gradient depending on the position of a weight vector to make it more compression-friendly. First, we theoretically show that applying PSG to the standard gradient descent (GD), which is called PSGD, is equivalent to the GD in the warped weight space, a space made by warping the original weight space via an appropriately designed invertible function. Second, we empirically show that PSG acting as a regularizer to a weight vector is very useful in model compression domains such as quantization and sparse training. PSG reduces the gap between the weight distributions of a full-precision model and its compressed counterpart. This enables the versatile deployment of a model either as an uncompressed mode or as a compressed mode depending on the availability of resources. The experimental results on CIFAR-10/100 and Imagenet datasets show the effectiveness of the proposed PSG in both domains of sparse training and quantization even for extremely low bits.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning low-precision neural networks without Straight-Through Estimator(STE)

The Straight-Through Estimator (STE) is widely used for back-propagating...

Weight Normalization based Quantization for Deep Neural Network Compression

With the development of deep neural networks, the size of network models...

Rate distortion comparison of a few gradient quantizers

This article is in the context of gradient compression. Gradient compres...

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Data parallelism can boost the training speed of convolutional neural ne...

Rotated Binary Neural Network

Binary Neural Network (BNN) shows its predominance in reducing the compl...

Current-mode Memristor Crossbars for Neuromemristive Systems

Motivated by advantages of current-mode design, this brief contribution ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many regularization strategies have been proposed to induce a prior to a neural network

hoerl1970ridge ; tibshirani1996regression ; hinton2015distilling ; kim2018paraphrasing

. Inspired by such regularization methods which add a prior or constraint for a specific purpose, in this paper, we propose a novel regularization method that non-uniformly scales gradient for model compression problems. The scaled gradient, whose scale depends on the position of the weight, constrains the weight to a set of compression-friendly grid points. We replace the conventional gradient in the stochastic gradient descent (SGD) with the proposed position-based scaled gradient (PSG) and call it as PSGD. We show that PSGD in the original weight space is equivalent to optimizing the weights by a standard SGD in a warped space, to which weights from the original space are warped by an invertible function, which is designed such that the weights of the original space are forced to merge to the desired target positions by scaling the gradients.

(a) Classification and Quantization Error
(b) Weight Distribution of Full and 4-bit Precision
Figure 1: Results of ResNet-34 on CIFAR100 (a) Mean-squared quantization error (line) and classification error (bar) across different bits. Blue: SGD, Red: PSGD. (b) Example of weight distribution (Conv2_1 layer he2016deep ) trained with standard SGD and our PSGD. For PSGD, the distribution of the full precision weights closely resembles the low precision distribution, yet maintains its accuracy.

We are not the first to scale the gradient elements. The scaled gradient method which is also known as the variable metric method davidon1991variable multiplies a positive definite matrix to the gradient vector to scale the gradient. It includes a wide variety of methods such as the Newton method, Quasi-Netwon methods and the natural gradient method dennis1977quasi ; nocedal2006numerical ; bottou2010large

. Generally, they rely on Hessian estimation or Fisher information matrix for their scaling. However, our method is different from them in that our scaling does not depend on the loss function but it depends solely on the current position of the weight.

We apply the proposed PSG method to the model compression problems such as quantization and pruning. In recent years, deploying a deep neural network (DNN) on restricted edge devices such as smartphones and IoT devices has become a very important issue. For these reasons, reducing bit-width of model weights (quantization) and removing unimportant model weights (pruning) have been studied and widely used for applications. Majority of the literature in quantization starts with a pre-trained model and fine-tunes or re-trains the model using the entire training dataset. However, this scenario is restrictive in real-world applications because additional training is needed. In the additional training phase, a full-size dataset and high computational resources are required which prohibits easy and fast deployment of DNNs on edge devices for customers in need.

To resolve this problem, many works have focused on post-training methods of quantization that do not require training data krishnamoorthi2018quantizing ; nagel2019data ; banner2019post ; zhao2019improving . For example, nagel2019data starts with a pre-trained model with only minor modification on the weights by equalizing the scales across channels and correcting biases. However, inherent discrepancy in the distribution of the pre-trained model and that of the quantized model is too large for the aforementioned methods to offset the fundamental difference in the distributions. As shown in Fig. 1, due to the differences in the two distributions, the quantization and classification error increases as lower bit-width is used. Accordingly, when it comes to layer-wise quantization, existing post-training methods suffer significant accuracy degradation when it is quantized below 6-bits.

Meanwhile, another line of research in quantization approaches the task from the initial training phase. Gradient regularization alizadeh2020gradient trains the model with gradient regularization for quantization robustness across different bit widths at the initial training phase. While our method follows this scheme of training from the start, unlike alizadeh2020gradient , no explicit regularization term is introduced to the loss function. Instead, the gradients are scaled depending on the position of the weights. Our main goal is to train a robust model that can be easily switched to a compressed mode from an uncompressed mode when the resources are limited, without the need of re-training, fine-tuning and even accessing the data. To achieve this, we constrain the original weights to merge to a set of quantized grid points (Fig. 1(b)) by scaling their gradients proportional to the error between the original weight and its quantized version. For the sparse training case, the weights are regularized to merge to zero. More details will be described in Sec 3.

Our contributions can be summarized as follows:

We propose a novel regularization method for model compression by introducing the position-based scaled gradient (PSG) which can be considered as a variant of the variable metric method.

We prove theoretically that PSG descent (PSGD) is equivalent to applying the standard gradient descent in the warped weight space. This leads the weight to converge to a well-performing local minimum in both compressed and uncompressed weight spaces (see Fig. 1(b)).

We apply PSG in quantization and sparse training and verify the effectiveness of PSG on CIFAR and ImageNet datasets. Also, we show PSGD is very effective for extremely low-bit quantization.

2 Related work

Quantization  Post-training quantization aims to quantize weights and activation to discrete grid points without additional training or using the training data. Majority of the works in recent literature starts from a pre-trained network trained by standard training scheme zhao2019improving ; nagel2019data ; banner2019post

. Many works on channel-wise quantization methods, which require storing quantization parameters per channel, focus on treating outliers in the activation maps. Alternatively, layer-wise quantization methods are more hardware-friendly and undergo minimum overhead at inference time as they store quantization parameters per layer (as opposed to per channel)

nagel2019data ; krishnamoorthi2018quantizing ; zhao2019improving . nagel2019data achieves near full-precision accuracy at 8 bits by bias correction and range equalization of channels, while zhao2019improving splits channels with outliers to reduce the clipping error. However, both suffer from severe accuracy degradation under 6 bits. Our method improves on but is not limited to the uniform layer-wise quantization.

Meanwhile, another line of work in quantization has focused on quantization robustness by regularizing the weight distribution from the training phase. lin2018defensive focuses on minimizing the Lipshitz constant to regularize the gradients for robustness against adverserial attacks. Similarly, alizadeh2020gradient proposes a new regularization term on the norm of the gradients for quantization robustness across different bit widths. This enables "on-the-fly" quantization to various bit widths. Our method does not have an explicit regularization term but scales the gradients to implicitly regularize the weights in the full-precision domain to make them quantization-friendly. By doing so, we achieve state-of-the-art accuracies for layer-wise quantization as well as robustness across various bit widths. Additionally, we do not introduce significant training overhead because gradient norm regularization is not necessary, while alizadeh2020gradient

necessitates double-backpropagation which increases the training complexity.

Pruning  Another relevant line of research in model compression is pruning, in which unimportant units such as weights, filters, or entire blocks are pruned huang2018data ; li2016pruning ; huang2018data . Recent works have focused on pruning methods that include the pruning process in the training phase renda2020comparing ; zhu2017prune ; louizos2018learning ; lee2018snip . Among them, substantial amount of works utilize sparsity-inducing regularization. louizos2018learning proposes training with L0 norm regularizer on individual weights to train a sparse network, using the expected L0 objective to relax the otherwise indifferentiable regularization term. Meanwhile, other works focus on using saliency criterion. lee2018snip utilizes gradients of masks as a proxy for the importance to prune networks at a single-shot. Similar to lee2018snip and louizos2018learning

, our method does not need the heuristic pruning schedule while training nor it needs additional fine-tuning after pruning. In our method, pruning is formulated as a subclass of quantization because PSG can be used for sparse training by setting the target value as zero instead of the quantized grid points.

3 Proposed method

In this section, we describe the proposed position-based scaled gradient descent (PSGD) method. In PSGD, a scaling function regularizes the original weight to merge to one of the desired target points which performs well at both uncompressed and compressed domains. This is equivalent to optimizing via SGD in the warped weight space. With a specially designed invertible function that warps the original weight space, the loss function in this warped space converges to a different local minima that are more compression-friendly compared to the solutions driven in the original weight space.

We first prove that optimizing in the original space with PSGD is equivalent to optimizing in the warped space with gradient descent. Then, we demonstrate how PSGD is used to constrain the weights to a set of desired target points. Lastly, we show how this method is able to yield comparable performance with that of vanilla SGD in the original uncompressed domain, despite being strongly regularized.

3.1 Optimization in warped space

Theorem: Let , be an arbitrary invertible multivariate function that warps the original weight space into and consider the loss function and the equivalent loss function . Then, the gradient descent (GD) method in the warped space is equivalent to applying a scaled gradient descent in the original space such that


where and and respectively denote the gradient and Jacobian of the function with respect to the variable .

Proof: Consider the point at time and its warped version . To find the local minimum of , the standard gradient descent method at time step in the warped space can be applied as follows:


Here, is the gradient and is the learning rate. Applying the inverse function to , we obtain the updated point :


where the last equality is from the first-order Taylor approximation and is the Jacobian of with respect to

. By the chain rule,

. Because , we can rewrite Eq.(3) as


Now Eq.(2) and Eq.(4) are equivalent and Eq.(1) is proved. In other words, the scaled gradient descent in the original space whose scaling is determined by the matrix , which is the PSGD method, is equivalent to gradient descent in the warped space .

3.2 Position-based scaled gradient

In this part, we introduce one example of designing the invertible function for scaling the gradients. This invertible function should cause the original weight vector to merge to a set of desired target points . These kinds of desired target weights can act as a prior in the optimization process to constrain original weights to be merged at specific positions. The details of how to set the target points will be deferred to the next subsection.

The gist of weight-dependent gradient scaling is simple. For a given weight vector, if the specific weight element is far from the desired target point, a higher scaling value is applied so as to escape this position faster. On the other hand, if the distance is small, lower scaling value is applied to prevent the weight vector from deviating from the position. From now on, we focus on the design of the scaling function for the quantization problem. For pruning, the procedure is analogous and we omit the detail.

Scaling function: We use the same warping function for each coordinate independently, i.e. . Thus the Jacobian matrix becomes diagonal () and our method belongs to the diagonally scaled gradient method.

Consider the following warping function


where the target is determined as the closest grid point from , is a sign function and is a constant dependent on the specific grid point making the function continuous111Details on can be found in the supplementary. Also, another example of warping function and its experimental results are included in the supplementary.. is an arbitrarily small constant to avoid infinite gradient. Then, from Eq.(4), the elementwise scaling function becomes and consequently


Using the elementwise scaling function Eq.(6), the elementwise weight update rule for the PSG descent (PSGD) becomes


where, is the learning rate222We set where is the conventional learning rate and is a hyper-parameter that can be set differently for various scaling functions depending on their range..

Figure 2: Toy example of warping a loss function . Left denotes the original loss function. Right is drawn by warping the orignal function by Eq. (5) with the target .

3.3 Target points

Quantization: In this paper, we use the uniform symmetric quantization method krishnamoorthi2018quantizing

and the per-tensor quantization scheme for hardware friendliness. Consider a floating point range [

,] of model weights. The weight is quantized to an integer ranging [,] for precision. Quantization-dequantization for the weights of a network is defined with step-size () and clipping values. The overall quantization process is as follows:


where is the round to the closest integer operation and

We can get the quantized weights with the de-quantization process as and use this quantized weights for target positions of quantization.
Sparse Training: For magnitude-based pruning methods, weights near zero are removed. Therefore, we choose zero as the target value (i.e. ).

3.4 PSGD for deep networks

Many literature focusing on the optimization of DNNs with stochastic gradient descent (SGD) have reported that multiple experiments give consistently similar performance although DNNs have many local minima (e.g. see Sec. 2 of ChaudhariCSL17 ). choromanska2015loss analyzed the loss surface of DNNs and showed that large networks have many local minima with similar performance on the test set and the lowest critical values of the random loss function are located in a specific band lower-bounded by the global minimum. From this respect, we explain informally how PSGD for deep networks works. As illustrated in Fig. 2, we posit that there exist many local minima () in the original weight space with similar performance, only some of which () are close to one of the target points (0) exhibiting high performance also in the compressed domain. As in Fig. 2 left, assume that the region of convergence for is much wider than that of , meaning that there exists more chance to output solution rather than from random initialization. By the warping function specially designed as described above (Eq. 5), the original space is warped to such that the areas near target points are expanded while those far from the targets are contracted. If we apply gradient descent in this warped space, the loss function will have a better chance of converging to . Correspondingly, PSGD in the original space will more likely output rather than , which is favorable for compression. Note that transforms the original weight space to the warped space not to the compressed domain.

4 Experiments

In this section, we experimentally show the effectiveness of the PSGD. To verify our PSGD method, we first conduct experiments for sparse training by setting the target point as 0, then we further extend our method to quantization with CIFAR krizhevsky2009learning and ImageNet ILSVRC 2015 ILSVRC15 dataset. We first demonstrate the effectiveness in sparse training with magnitude-based pruning by comparing with L0-regularization louizos2018learning and SNIP lee2018snip . louizos2018learning penalizes the non-zero model parameters and shares the scheme of regularizing the model while training. Like ours, lee2018snip is a single-shot pruning method, which does not require pruning schedules nor additional fine-tuning.

For quantization, we compare our method with (1) methods that employ regularization at the initial training phase alizadeh2020gradient ; gulrajani2017improved ; lin2018defensive . We choose gradient L1 norm regularization alizadeh2020gradient method and Lipschitz regularization methods lin2018defensive ; gulrajani2017improved from the original paper alizadeh2020gradient as baselines, because they propose new regularization techniques used at the training phase similar to us. Note that gulrajani2017improved adds an L2 penalty term on the gradient of weights instead of the L1 penalty like alizadeh2020gradient . We also compare with (2) existing state-of-the-art layer-wise post-training quantization methods that start from pre-trained models nagel2019data ; zhao2019improving to show the improvement in lower bits (4 bits). Refer to Section 2 for the details on the compared methods. To validate the effectiveness of our method, we also train our model for extremely low-bit (2,3 bits) weights. Lastly, we show the experimental results on various network architectures with PSGD and applying PSG to the Adam optimizer kingma2014adam , which are detailed in the supplementary.

Implementation details

 We used the Pytorch framework for all experiments. For the sparse training experiment of Table

3, we used ResNet-32 he2016deep

on the CIFAR-100, following the training hyperparameters of

kim2019qkd . We used released official implementations of louizos2018learning and re-implemented lee2018snip for the Pytorch framework. For quantization experiments of Table 2 and 4, we used ResNet-18 and followed alizadeh2020gradient settings for CIFAR-10 and ImageNet. For zhao2019improving , released official implementations were used for experiment. All other numbers are either from the original paper or re-implemented. For fair comparison, all quantization experiments followed the layer-wise uniform symmetric quantization krishnamoorthi2018quantizing

and when quantizing the activation, we clipped the activation range using batch normalization parameters as described in

nagel2019data , same as alizadeh2020gradient

. PSGD is applied from the last 15 epochs for ImageNet experiments and from the first learning rate decay epoch for CIFAR experiments. We use additional 30 epochs for PSGD at very low-bits experiments (Table

4). Also, we tuned the hyper-parameter for each bit-widths and sparsity. Our search criteria is ensuring that the performance of uncompressed model is not degraded, similar to alizadeh2020gradient . More details are in the supplementary.

Table 1: Test accuracy of ResNet-32 across different sparsity ratios (percentage of zeros) on CIFAR-100 after magnitude-based pruning han2015learning . width=0.9 Method Sparsity (%) 20.0 50.0 70.0 80.0 90.0 SGD 69.43 60.59 15.95 4.70 1.00 L0 Reg. louizos2018learning 67.56 64.49 49.73 23.95 2.85 SNIP lee2018snip 69.68 68.73 66.76 65.67 60.14 PSGD (Ours) 69.63 69.25 68.62 67.27 64.33 width=0.9
Figure 3: The weight distribution of SGD and PSGD models.

4.1 Sparse Training

As a preliminary experiment, we first demonstrate that PSG-based optimization is possible with a single target point set at zero. Then, we apply magnitude-based pruning following han2015learning across different sparsity ratios. As the purpose of the experiment is to verify that the weights are centered on zero, weights are pruned once after training has completed and the model is evaluated without fine-tuning for louizos2018learning and ours. Results for lee2018snip , which prunes the weights by single-shot at initialization, are shown for comparison on single-shot pruning.

Table 1 indicates that our method outperforms the two methods across various high sparsity ratios. While all three methods are able to maintain accuracy at low sparsity (10%), louizos2018learning has some accuracy degradation at 20% and suffers severely at high sparsity. This is in line with the results shown in Gale2019TheSO that the method was unable to produce sparse residual models without significant damage in model quality. Comparing with lee2018snip , our method is able to maintain higher accuracy even at high sparsity, displaying the strength in single-shot pruning, in which no pruning schedules nor additional training are necessary. Fig. 3 shows the distribution of weights in SGD- and PSGD-trained models.

width=1 Method ImageNet CIFAR-10 FP W8A8 W6A6 W4A4 FP W8A4 W6A4 W4A4 SGD 69.70 69.20 63.80 0.30 93.54 85.51 85.35 83.98 DQ Regularization lin2018defensive 68.28 67.76 62.31 0.24 92.46 83.31 83.34 82.47 Gradient L2 gulrajani2017improved 68.34 68.02 64.52 0.19 93.31 84.50 84.99 83.82 Gradient L1 alizadeh2020gradient 70.07 69.92 66.39 0.22 93.36 88.70 88.45 87.62 Gradient L1 () alizadeh2020gradient 64.02 63.76 61.19 55.32 PSGD@W8 (Ours) 70.22 70.13 66.02 0.60 93.67 93.10 93.03 90.65 PSGD@W6 (Ours) 70.07 69.83 69.51 0.29 93.54 92.76 92.88 90.55 PSGD@W4 (Ours) 68.18 67.63 62.73 63.45  93.63 93.04 93.12 91.03

Table 2: Test accuracy of regularization methods that do not have post-training process for ResNet-18 on the ImageNet dataset. PSGD@W# indicates the target number of bits for weights in PSGD is #. All numbers except ours are from alizadeh2020gradient . At #-bit, PSGD@W# performs the best in most cases.

width=1 Method W8A8 W6A6 W4A4 DFQ nagel2019data 69.7 66.3 OCS + Best Clip zhao2019improving 69.37 66.76 44.3 PSGD (Ours) 70.13 69.51 63.45

Table 4: Very low-bits accuracy of ResNet-18 on the ImageNet dataset. The first convolutional layer and the last linear layer are quantized at 8-bits. Activation is fixed to 8-bits.

width=1 Method (FP W3A8) (FP W2A8) SGD 69.76 0.10 69.76 0.10 PSGD (Ours) 66.75 66.36 64.60 62.65

Table 3: Comparison with Post-training Quantization Methods of ResNet-18 on the ImageNet dataset. Results of DFQ are from nagel2019data .

4.2 Quantization

In the quantization domain, we first compare PSGD with regularization methods at the on-the-fly bit-widths problem, meaning that a single model is evaluated across various bit-widths. Then, we compare with existing state-of-the-art layer-wise symmetric post-training methods to verify handling the problem of accuracy drop at low-bits due to the differences in weight distributions (See Fig. 1).

Regularization methods  Table 2 shows the results of regularization methods on CIFAR-10 and ImageNet datasets, respectively. In the CIFAR-10 experiments of Table 2, we fix the activation bit-widths as 4 bits and then vary the weight bit-widths from 8 to 4. For the ImageNet experiments of Table 2, we use equal bit-widths for both weights and activations, following alizadeh2020gradient . In CIFAR-10 experiment, all methods seem to maintain the performance of the quantized model until 4-bits quantization. Regardless of target bit-widths, PSGD outperforms all other regularization methods.

On the other hand, ImageNet experiment generally shows reasonable results until 6-bits but the accuracy drastically drops at 4-bits. PSGD targeting 8-bits and 6-bits marginally improves on all bits, yet also experiences drastic accuracy drop at 4-bits. In contrast, Gradient L1 () and PSGD @ W4 maintain the performance of the quantized models even at 4-bits. Comparing with the second best method Gradient L1 () alizadeh2020gradient , PSGD outperforms it at all bit-widths. At full precision (FP), 8-, 6- and 4-bits, the gap of performance between alizadeh2020gradient and ours are about 4.2%, 3.8%, 1.8% and 8%, respectively. From Table 2, while the quantization noise may slightly degrade the accuracy in some cases, a general trend that using more bits leads to higher accuracy is demonstrated. Consequently, as our 4-bit targeted model (PSGD@W4) achieves comparable accuracy for all higher bits of the other methods, PSGD outperforms conventional methods in all bit-widths. Compared to other regularization methods, PSGD is able to maintain reasonable performance across all bits by constraining the distribution of the full precision weight to resemble that of the quantized weight. This quantization-friendliness is achieved by the appropriately designed scaling function. Also, unlike alizadeh2020gradient , PSGD does not need additional overhead of calculating the double-backpropagation.

Post-training methods  Table 4 shows that OCS, state-of-the-art post-training method, has a drastic accuracy drop at 4-bits. For OCS, following the original paper, we chose the best clipping method for both weights and activation. DFQ also has a similar tendency of showing drastic accuracy drop under the 6 bit-widths depicted in Fig. 1 at the original paper of DFQ nagel2019data . This is due to the fundamental discrepancy between FP and quantized weight distributions as stated in Sec 1 and Fig. 1. On the other hand, models trained with PSGD have similar full-precision and quantized weight distributions and hence low quantization error due to the scaling function. Our method outperforms OCS at 4-bits by around 19% without any post-training.

Very low bit quantization  As shown in Fig. 1, SGD suffers drastic accuracy drops at very low-bits such as 3 and 2 bits. To confirm that PSGD can handle very low-bits, we conduct experiments with PSGD targeting 3 and 2 bits except first and last layers which are quantized at 8-bits. Table 4 shows the results of applying PSGD at very low-bits. Although FP of PSGD has an accuracy drop because of strong constraints from very low-bit targets, PSGD works well at very low-bit quantization. This shows PSGD can be a key solution for post-trainings at very low-bits.

(a) Weight distribution (1st layer)

Histogram of eigenvalues (

Figure 4: Weight distribution and histogram of eigenvalues larger than 0.1 for MNIST dataset. The two-layered fully connected network consists of 50 and 20 hidden nodes. Target bit of PSGD is 2.

5 Discussion

In this section, we focus on the local minima found by PSG with a toy example to better understand PSG. We train with SGD and PSGD on 2-bits on MNIST dataset on fully-connected network consisting of two hidden layers (50, 20 neurons). In this toy example, we only quantize weights but not the activation. We show the weight the distributions of the two models trained with SGD and PSGD at the first layer. Then, we calculate the eigenvalues of the entire Hessian matrix to analyze the curvature of a local loss surface.

Quantized and sparse model SGD generally yields a bell-shaped distribution of weights which is not adaptable for low-bits quantization zhao2019improving . On the other hand, PSGD always provides a multi-modal distribution peaked at the quantized values. For this example, the number of bins are three (2-bits) so the weights are merged into three clusters as depicted in Fig. 4a. A large proportion of the weights are near zero similar to Fig. 3. This is because symmetric quantization also contains zero as the target bin. PSGD has nearly the same accuracy with FP (96%) at 2-bits. However, the accuracy of SGD at 2-bits is about 9%, although the FP accuracy is 97%. This tendency is also shown in Fig. 1b, which demonstrates that the PSGD reduces the quantization error.

Curvature of PSGD solution As we explained in Sec 3.1, PSGD which is equivalent to doing SGD on the warped space, can be used to find compression-friendly minima. In Sec 3.4 and Fig. 2

, we claimed that PSG finds a minimum with sharp valleys that is more compression friendly, but has a less chance to be found. As the curvature in the direction of the Hessian eigenvector is determined by the corresponding eigenvalue

Goodfellow-et-al-2016 , we compare the curvature of solutions yielded by SGD and PSGD by assessing the magnitude of the eigenvalues, similar to chaudhari2019entropy . SGD provides minima with relatively wide valleys because it has many near-zero eigenvalues and the similar tendency is observed in chaudhari2019entropy . However, the weights trained by PSGD have much more large positive eigenvalues, which means the solution lies in a relatively sharp valley compared to SGD. Specifically, the number of large eigenvalues () in PSGD is 9 times more than that of SGD. From this toy example, we confirm that PSG helps to find the minima which are more compression-friendly (Fig 4a) and lie in sharp valleys (Fig. 4b) hard to reach by normal SGD.

6 Conclusion

In this work, we introduce the position-based scaled gradient (PSG) which scales the gradient in proportion to the distance between the current weight and the corresponding nearest target point. We prove the stochastic PSG descent (PSGD) is equivalent to applying the SGD in the warped space. Based on hypothesis that DNN has many local minima with similar performance on the test set, PSGD is able to find a compression-friendly minimum that is hard to reach by other optimizers. PSGD can be a key solution to low-bit post training quantization becasue PSGD reduces the quantization error meaning that the distributions between compressed and uncompressed weights are similar. Because target points acts as a prior to constrain original weights to be merged at specific positions, PSGD also can be used for the sparse training by simply changing the target point as 0. In our experiments, we verify PSGD in the domain of sparse training and quantization by showing the effectiveness of PSGD on various image classification datasets such as CIFAR-10,100 and ImageNet. Also, we empirically show that PSGD finds the minima which are located in sharp valleys than that of SGD. We believe that PSGD will help further researches in model quantization and sparse training.


  • (1) Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient regularization for quantization robustness. In International Conference on Learning Representations, 2020.
  • (2) Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, pages 7948–7956, 2019.
  • (3) Léon Bottou.

    Large-scale machine learning with stochastic gradient descent.

    In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
  • (4) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
  • (5) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017.
  • (6) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
  • (7) William C Davidon. Variable metric method for minimization. SIAM Journal on Optimization, 1(1):1–17, 1991.
  • (8) John E Dennis, Jr and Jorge J Moré. Quasi-newton methods, motivation and theory. SIAM review, 19(1):46–89, 1977.
  • (9) Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ArXiv, abs/1902.09574, 2019.
  • (10) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • (11) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
  • (12) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
  • (13) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • (14) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • (15) Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  • (16) Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018.
  • (17) Jangho Kim, Yash Bhalgat, Jinwon Lee, Chirag Patel, and Nojun Kwak. Qkd: Quantization-aware knowledge distillation. arXiv preprint arXiv:1911.12491, 2019.
  • (18) Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems, pages 2760–2769, 2018.
  • (19) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (20) Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper, 2018.
  • (21) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • (22) Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, 2019.
  • (23) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
  • (24) Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efficiency meets robustness. In International Conference on Learning Representations, 2019.
  • (25) Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018.
  • (26) Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1325–1334, 2019.
  • (27) Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • (28) Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
  • (29) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • (30) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • (31) Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. International Conference on Machine Learning (ICML), pages 7543–7552, June 2019.
  • (32) Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.