1 Introduction
Many regularization strategies have been proposed to induce a prior to a neural network
hoerl1970ridge ; tibshirani1996regression ; hinton2015distilling ; kim2018paraphrasing. Inspired by such regularization methods which add a prior or constraint for a specific purpose, in this paper, we propose a novel regularization method that nonuniformly scales gradient for model compression problems. The scaled gradient, whose scale depends on the position of the weight, constrains the weight to a set of compressionfriendly grid points. We replace the conventional gradient in the stochastic gradient descent (SGD) with the proposed positionbased scaled gradient (PSG) and call it as PSGD. We show that PSGD in the original weight space is equivalent to optimizing the weights by a standard SGD in a warped space, to which weights from the original space are warped by an invertible function, which is designed such that the weights of the original space are forced to merge to the desired target positions by scaling the gradients.
We are not the first to scale the gradient elements. The scaled gradient method which is also known as the variable metric method davidon1991variable multiplies a positive definite matrix to the gradient vector to scale the gradient. It includes a wide variety of methods such as the Newton method, QuasiNetwon methods and the natural gradient method dennis1977quasi ; nocedal2006numerical ; bottou2010large
. Generally, they rely on Hessian estimation or Fisher information matrix for their scaling. However, our method is different from them in that our scaling does not depend on the loss function but it depends solely on the current position of the weight.
We apply the proposed PSG method to the model compression problems such as quantization and pruning. In recent years, deploying a deep neural network (DNN) on restricted edge devices such as smartphones and IoT devices has become a very important issue. For these reasons, reducing bitwidth of model weights (quantization) and removing unimportant model weights (pruning) have been studied and widely used for applications. Majority of the literature in quantization starts with a pretrained model and finetunes or retrains the model using the entire training dataset. However, this scenario is restrictive in realworld applications because additional training is needed. In the additional training phase, a fullsize dataset and high computational resources are required which prohibits easy and fast deployment of DNNs on edge devices for customers in need.
To resolve this problem, many works have focused on posttraining methods of quantization that do not require training data krishnamoorthi2018quantizing ; nagel2019data ; banner2019post ; zhao2019improving . For example, nagel2019data starts with a pretrained model with only minor modification on the weights by equalizing the scales across channels and correcting biases. However, inherent discrepancy in the distribution of the pretrained model and that of the quantized model is too large for the aforementioned methods to offset the fundamental difference in the distributions. As shown in Fig. 1, due to the differences in the two distributions, the quantization and classification error increases as lower bitwidth is used. Accordingly, when it comes to layerwise quantization, existing posttraining methods suffer significant accuracy degradation when it is quantized below 6bits.
Meanwhile, another line of research in quantization approaches the task from the initial training phase. Gradient regularization alizadeh2020gradient trains the model with gradient regularization for quantization robustness across different bit widths at the initial training phase. While our method follows this scheme of training from the start, unlike alizadeh2020gradient , no explicit regularization term is introduced to the loss function. Instead, the gradients are scaled depending on the position of the weights. Our main goal is to train a robust model that can be easily switched to a compressed mode from an uncompressed mode when the resources are limited, without the need of retraining, finetuning and even accessing the data. To achieve this, we constrain the original weights to merge to a set of quantized grid points (Fig. 1(b)) by scaling their gradients proportional to the error between the original weight and its quantized version. For the sparse training case, the weights are regularized to merge to zero. More details will be described in Sec 3.
Our contributions can be summarized as follows:
We propose a novel regularization method for model compression by introducing the positionbased scaled gradient (PSG) which can be considered as a variant of the variable metric method.
We prove theoretically that PSG descent (PSGD) is equivalent to applying the standard gradient descent in the warped weight space. This leads the weight to converge to a wellperforming local minimum in both compressed and uncompressed weight spaces (see Fig. 1(b)).
We apply PSG in quantization and sparse training and verify the effectiveness of PSG on CIFAR and ImageNet datasets. Also, we show PSGD is very effective for extremely lowbit quantization.
2 Related work
Quantization Posttraining quantization aims to quantize weights and activation to discrete grid points without additional training or using the training data. Majority of the works in recent literature starts from a pretrained network trained by standard training scheme zhao2019improving ; nagel2019data ; banner2019post
. Many works on channelwise quantization methods, which require storing quantization parameters per channel, focus on treating outliers in the activation maps. Alternatively, layerwise quantization methods are more hardwarefriendly and undergo minimum overhead at inference time as they store quantization parameters per layer (as opposed to per channel)
nagel2019data ; krishnamoorthi2018quantizing ; zhao2019improving . nagel2019data achieves near fullprecision accuracy at 8 bits by bias correction and range equalization of channels, while zhao2019improving splits channels with outliers to reduce the clipping error. However, both suffer from severe accuracy degradation under 6 bits. Our method improves on but is not limited to the uniform layerwise quantization.Meanwhile, another line of work in quantization has focused on quantization robustness by regularizing the weight distribution from the training phase. lin2018defensive focuses on minimizing the Lipshitz constant to regularize the gradients for robustness against adverserial attacks. Similarly, alizadeh2020gradient proposes a new regularization term on the norm of the gradients for quantization robustness across different bit widths. This enables "onthefly" quantization to various bit widths. Our method does not have an explicit regularization term but scales the gradients to implicitly regularize the weights in the fullprecision domain to make them quantizationfriendly. By doing so, we achieve stateoftheart accuracies for layerwise quantization as well as robustness across various bit widths. Additionally, we do not introduce significant training overhead because gradient norm regularization is not necessary, while alizadeh2020gradient
necessitates doublebackpropagation which increases the training complexity.
Pruning Another relevant line of research in model compression is pruning, in which unimportant units such as weights, filters, or entire blocks are pruned huang2018data ; li2016pruning ; huang2018data . Recent works have focused on pruning methods that include the pruning process in the training phase renda2020comparing ; zhu2017prune ; louizos2018learning ; lee2018snip . Among them, substantial amount of works utilize sparsityinducing regularization. louizos2018learning proposes training with L0 norm regularizer on individual weights to train a sparse network, using the expected L0 objective to relax the otherwise indifferentiable regularization term. Meanwhile, other works focus on using saliency criterion. lee2018snip utilizes gradients of masks as a proxy for the importance to prune networks at a singleshot. Similar to lee2018snip and louizos2018learning
, our method does not need the heuristic pruning schedule while training nor it needs additional finetuning after pruning. In our method, pruning is formulated as a subclass of quantization because PSG can be used for sparse training by setting the target value as zero instead of the quantized grid points.
3 Proposed method
In this section, we describe the proposed positionbased scaled gradient descent (PSGD) method. In PSGD, a scaling function regularizes the original weight to merge to one of the desired target points which performs well at both uncompressed and compressed domains. This is equivalent to optimizing via SGD in the warped weight space. With a specially designed invertible function that warps the original weight space, the loss function in this warped space converges to a different local minima that are more compressionfriendly compared to the solutions driven in the original weight space.
We first prove that optimizing in the original space with PSGD is equivalent to optimizing in the warped space with gradient descent. Then, we demonstrate how PSGD is used to constrain the weights to a set of desired target points. Lastly, we show how this method is able to yield comparable performance with that of vanilla SGD in the original uncompressed domain, despite being strongly regularized.
3.1 Optimization in warped space
Theorem: Let , be an arbitrary invertible multivariate function that warps the original weight space into and consider the loss function and the equivalent loss function . Then, the gradient descent (GD) method in the warped space is equivalent to applying a scaled gradient descent in the original space such that
(1) 
where and and respectively denote the gradient and Jacobian of the function with respect to the variable .
Proof: Consider the point at time and its warped version . To find the local minimum of , the standard gradient descent method at time step in the warped space can be applied as follows:
(2) 
Here, is the gradient and is the learning rate. Applying the inverse function to , we obtain the updated point :
(3) 
where the last equality is from the firstorder Taylor approximation and is the Jacobian of with respect to
. By the chain rule,
. Because , we can rewrite Eq.(3) as(4) 
Now Eq.(2) and Eq.(4) are equivalent and Eq.(1) is proved. In other words, the scaled gradient descent in the original space whose scaling is determined by the matrix , which is the PSGD method, is equivalent to gradient descent in the warped space .
3.2 Positionbased scaled gradient
In this part, we introduce one example of designing the invertible function for scaling the gradients. This invertible function should cause the original weight vector to merge to a set of desired target points . These kinds of desired target weights can act as a prior in the optimization process to constrain original weights to be merged at specific positions. The details of how to set the target points will be deferred to the next subsection.
The gist of weightdependent gradient scaling is simple. For a given weight vector, if the specific weight element is far from the desired target point, a higher scaling value is applied so as to escape this position faster. On the other hand, if the distance is small, lower scaling value is applied to prevent the weight vector from deviating from the position. From now on, we focus on the design of the scaling function for the quantization problem. For pruning, the procedure is analogous and we omit the detail.
Scaling function: We use the same warping function for each coordinate independently, i.e. . Thus the Jacobian matrix becomes diagonal () and our method belongs to the diagonally scaled gradient method.
Consider the following warping function
(5) 
where the target is determined as the closest grid point from , is a sign function and is a constant dependent on the specific grid point making the function continuous^{1}^{1}1Details on can be found in the supplementary. Also, another example of warping function and its experimental results are included in the supplementary.. is an arbitrarily small constant to avoid infinite gradient. Then, from Eq.(4), the elementwise scaling function becomes and consequently
(6) 
Using the elementwise scaling function Eq.(6), the elementwise weight update rule for the PSG descent (PSGD) becomes
(7) 
where, is the learning rate^{2}^{2}2We set where is the conventional learning rate and is a hyperparameter that can be set differently for various scaling functions depending on their range..
3.3 Target points
Quantization: In this paper, we use the uniform symmetric quantization method krishnamoorthi2018quantizing
and the pertensor quantization scheme for hardware friendliness. Consider a floating point range [
,] of model weights. The weight is quantized to an integer ranging [,] for precision. Quantizationdequantization for the weights of a network is defined with stepsize () and clipping values. The overall quantization process is as follows:(8) 
where is the round to the closest integer operation and
We can get the quantized weights with the dequantization process as
and use this quantized weights for target positions of quantization.
Sparse Training:
For magnitudebased pruning methods, weights near zero are removed. Therefore, we choose zero as the target value (i.e. ).
3.4 PSGD for deep networks
Many literature focusing on the optimization of DNNs with stochastic gradient descent (SGD) have reported that multiple experiments give consistently similar performance although DNNs have many local minima (e.g. see Sec. 2 of ChaudhariCSL17 ). choromanska2015loss analyzed the loss surface of DNNs and showed that large networks have many local minima with similar performance on the test set and the lowest critical values of the random loss function are located in a specific band lowerbounded by the global minimum. From this respect, we explain informally how PSGD for deep networks works. As illustrated in Fig. 2, we posit that there exist many local minima () in the original weight space with similar performance, only some of which () are close to one of the target points (0) exhibiting high performance also in the compressed domain. As in Fig. 2 left, assume that the region of convergence for is much wider than that of , meaning that there exists more chance to output solution rather than from random initialization. By the warping function specially designed as described above (Eq. 5), the original space is warped to such that the areas near target points are expanded while those far from the targets are contracted. If we apply gradient descent in this warped space, the loss function will have a better chance of converging to . Correspondingly, PSGD in the original space will more likely output rather than , which is favorable for compression. Note that transforms the original weight space to the warped space not to the compressed domain.
4 Experiments
In this section, we experimentally show the effectiveness of the PSGD. To verify our PSGD method, we first conduct experiments for sparse training by setting the target point as 0, then we further extend our method to quantization with CIFAR krizhevsky2009learning and ImageNet ILSVRC 2015 ILSVRC15 dataset. We first demonstrate the effectiveness in sparse training with magnitudebased pruning by comparing with L0regularization louizos2018learning and SNIP lee2018snip . louizos2018learning penalizes the nonzero model parameters and shares the scheme of regularizing the model while training. Like ours, lee2018snip is a singleshot pruning method, which does not require pruning schedules nor additional finetuning.
For quantization, we compare our method with (1) methods that employ regularization at the initial training phase alizadeh2020gradient ; gulrajani2017improved ; lin2018defensive . We choose gradient L1 norm regularization alizadeh2020gradient method and Lipschitz regularization methods lin2018defensive ; gulrajani2017improved from the original paper alizadeh2020gradient as baselines, because they propose new regularization techniques used at the training phase similar to us. Note that gulrajani2017improved adds an L2 penalty term on the gradient of weights instead of the L1 penalty like alizadeh2020gradient . We also compare with (2) existing stateoftheart layerwise posttraining quantization methods that start from pretrained models nagel2019data ; zhao2019improving to show the improvement in lower bits (4 bits). Refer to Section 2 for the details on the compared methods. To validate the effectiveness of our method, we also train our model for extremely lowbit (2,3 bits) weights. Lastly, we show the experimental results on various network architectures with PSGD and applying PSG to the Adam optimizer kingma2014adam , which are detailed in the supplementary.
Implementation details
We used the Pytorch framework for all experiments. For the sparse training experiment of Table
3, we used ResNet32 he2016deepon the CIFAR100, following the training hyperparameters of
kim2019qkd . We used released official implementations of louizos2018learning and reimplemented lee2018snip for the Pytorch framework. For quantization experiments of Table 2 and 4, we used ResNet18 and followed alizadeh2020gradient settings for CIFAR10 and ImageNet. For zhao2019improving , released official implementations were used for experiment. All other numbers are either from the original paper or reimplemented. For fair comparison, all quantization experiments followed the layerwise uniform symmetric quantization krishnamoorthi2018quantizingand when quantizing the activation, we clipped the activation range using batch normalization parameters as described in
nagel2019data , same as alizadeh2020gradient. PSGD is applied from the last 15 epochs for ImageNet experiments and from the first learning rate decay epoch for CIFAR experiments. We use additional 30 epochs for PSGD at very lowbits experiments (Table
4). Also, we tuned the hyperparameter for each bitwidths and sparsity. Our search criteria is ensuring that the performance of uncompressed model is not degraded, similar to alizadeh2020gradient . More details are in the supplementary.4.1 Sparse Training
As a preliminary experiment, we first demonstrate that PSGbased optimization is possible with a single target point set at zero. Then, we apply magnitudebased pruning following han2015learning across different sparsity ratios. As the purpose of the experiment is to verify that the weights are centered on zero, weights are pruned once after training has completed and the model is evaluated without finetuning for louizos2018learning and ours. Results for lee2018snip , which prunes the weights by singleshot at initialization, are shown for comparison on singleshot pruning.
Table 1 indicates that our method outperforms the two methods across various high sparsity ratios. While all three methods are able to maintain accuracy at low sparsity (10%), louizos2018learning has some accuracy degradation at 20% and suffers severely at high sparsity. This is in line with the results shown in Gale2019TheSO that the method was unable to produce sparse residual models without significant damage in model quality. Comparing with lee2018snip , our method is able to maintain higher accuracy even at high sparsity, displaying the strength in singleshot pruning, in which no pruning schedules nor additional training are necessary. Fig. 3 shows the distribution of weights in SGD and PSGDtrained models.
4.2 Quantization
In the quantization domain, we first compare PSGD with regularization methods at the onthefly bitwidths problem, meaning that a single model is evaluated across various bitwidths. Then, we compare with existing stateoftheart layerwise symmetric posttraining methods to verify handling the problem of accuracy drop at lowbits due to the differences in weight distributions (See Fig. 1).
Regularization methods Table 2 shows the results of regularization methods on CIFAR10 and ImageNet datasets, respectively. In the CIFAR10 experiments of Table 2, we fix the activation bitwidths as 4 bits and then vary the weight bitwidths from 8 to 4. For the ImageNet experiments of Table 2, we use equal bitwidths for both weights and activations, following alizadeh2020gradient . In CIFAR10 experiment, all methods seem to maintain the performance of the quantized model until 4bits quantization. Regardless of target bitwidths, PSGD outperforms all other regularization methods.
On the other hand, ImageNet experiment generally shows reasonable results until 6bits but the accuracy drastically drops at 4bits. PSGD targeting 8bits and 6bits marginally improves on all bits, yet also experiences drastic accuracy drop at 4bits. In contrast, Gradient L1 () and PSGD @ W4 maintain the performance of the quantized models even at 4bits. Comparing with the second best method Gradient L1 () alizadeh2020gradient , PSGD outperforms it at all bitwidths. At full precision (FP), 8, 6 and 4bits, the gap of performance between alizadeh2020gradient and ours are about 4.2%, 3.8%, 1.8% and 8%, respectively. From Table 2, while the quantization noise may slightly degrade the accuracy in some cases, a general trend that using more bits leads to higher accuracy is demonstrated. Consequently, as our 4bit targeted model (PSGD@W4) achieves comparable accuracy for all higher bits of the other methods, PSGD outperforms conventional methods in all bitwidths. Compared to other regularization methods, PSGD is able to maintain reasonable performance across all bits by constraining the distribution of the full precision weight to resemble that of the quantized weight. This quantizationfriendliness is achieved by the appropriately designed scaling function. Also, unlike alizadeh2020gradient , PSGD does not need additional overhead of calculating the doublebackpropagation.
Posttraining methods Table 4 shows that OCS, stateoftheart posttraining method, has a drastic accuracy drop at 4bits. For OCS, following the original paper, we chose the best clipping method for both weights and activation. DFQ also has a similar tendency of showing drastic accuracy drop under the 6 bitwidths depicted in Fig. 1 at the original paper of DFQ nagel2019data . This is due to the fundamental discrepancy between FP and quantized weight distributions as stated in Sec 1 and Fig. 1. On the other hand, models trained with PSGD have similar fullprecision and quantized weight distributions and hence low quantization error due to the scaling function. Our method outperforms OCS at 4bits by around 19% without any posttraining.
Very low bit quantization As shown in Fig. 1, SGD suffers drastic accuracy drops at very lowbits such as 3 and 2 bits. To confirm that PSGD can handle very lowbits, we conduct experiments with PSGD targeting 3 and 2 bits except first and last layers which are quantized at 8bits. Table 4 shows the results of applying PSGD at very lowbits. Although FP of PSGD has an accuracy drop because of strong constraints from very lowbit targets, PSGD works well at very lowbit quantization. This shows PSGD can be a key solution for posttrainings at very lowbits.
5 Discussion
In this section, we focus on the local minima found by PSG with a toy example to better understand PSG. We train with SGD and PSGD on 2bits on MNIST dataset on fullyconnected network consisting of two hidden layers (50, 20 neurons). In this toy example, we only quantize weights but not the activation. We show the weight the distributions of the two models trained with SGD and PSGD at the first layer. Then, we calculate the eigenvalues of the entire Hessian matrix to analyze the curvature of a local loss surface.
Quantized and sparse model SGD generally yields a bellshaped distribution of weights which is not adaptable for lowbits quantization zhao2019improving . On the other hand, PSGD always provides a multimodal distribution peaked at the quantized values. For this example, the number of bins are three (2bits) so the weights are merged into three clusters as depicted in Fig. 4a. A large proportion of the weights are near zero similar to Fig. 3. This is because symmetric quantization also contains zero as the target bin. PSGD has nearly the same accuracy with FP (96%) at 2bits. However, the accuracy of SGD at 2bits is about 9%, although the FP accuracy is 97%. This tendency is also shown in Fig. 1b, which demonstrates that the PSGD reduces the quantization error.
Curvature of PSGD solution As we explained in Sec 3.1, PSGD which is equivalent to doing SGD on the warped space, can be used to find compressionfriendly minima. In Sec 3.4 and Fig. 2
, we claimed that PSG finds a minimum with sharp valleys that is more compression friendly, but has a less chance to be found. As the curvature in the direction of the Hessian eigenvector is determined by the corresponding eigenvalue
Goodfellowetal2016 , we compare the curvature of solutions yielded by SGD and PSGD by assessing the magnitude of the eigenvalues, similar to chaudhari2019entropy . SGD provides minima with relatively wide valleys because it has many nearzero eigenvalues and the similar tendency is observed in chaudhari2019entropy . However, the weights trained by PSGD have much more large positive eigenvalues, which means the solution lies in a relatively sharp valley compared to SGD. Specifically, the number of large eigenvalues () in PSGD is 9 times more than that of SGD. From this toy example, we confirm that PSG helps to find the minima which are more compressionfriendly (Fig 4a) and lie in sharp valleys (Fig. 4b) hard to reach by normal SGD.6 Conclusion
In this work, we introduce the positionbased scaled gradient (PSG) which scales the gradient in proportion to the distance between the current weight and the corresponding nearest target point. We prove the stochastic PSG descent (PSGD) is equivalent to applying the SGD in the warped space. Based on hypothesis that DNN has many local minima with similar performance on the test set, PSGD is able to find a compressionfriendly minimum that is hard to reach by other optimizers. PSGD can be a key solution to lowbit post training quantization becasue PSGD reduces the quantization error meaning that the distributions between compressed and uncompressed weights are similar. Because target points acts as a prior to constrain original weights to be merged at specific positions, PSGD also can be used for the sparse training by simply changing the target point as 0. In our experiments, we verify PSGD in the domain of sparse training and quantization by showing the effectiveness of PSGD on various image classification datasets such as CIFAR10,100 and ImageNet. Also, we empirically show that PSGD finds the minima which are located in sharp valleys than that of SGD. We believe that PSGD will help further researches in model quantization and sparse training.
References
 (1) Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient regularization for quantization robustness. In International Conference on Learning Representations, 2020.
 (2) Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4bit quantization of convolutional networks for rapiddeployment. In Advances in Neural Information Processing Systems, pages 7948–7956, 2019.

(3)
Léon Bottou.
Largescale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.  (4) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
 (5) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017.
 (6) Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
 (7) William C Davidon. Variable metric method for minimization. SIAM Journal on Optimization, 1(1):1–17, 1991.
 (8) John E Dennis, Jr and Jorge J Moré. Quasinewton methods, motivation and theory. SIAM review, 19(1):46–89, 1977.
 (9) Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ArXiv, abs/1902.09574, 2019.
 (10) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 (11) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
 (12) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.

(13)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  (14) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 (15) Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
 (16) Zehao Huang and Naiyan Wang. Datadriven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018.
 (17) Jangho Kim, Yash Bhalgat, Jinwon Lee, Chirag Patel, and Nojun Kwak. Qkd: Quantizationaware knowledge distillation. arXiv preprint arXiv:1911.12491, 2019.
 (18) Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems, pages 2760–2769, 2018.
 (19) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (20) Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper, 2018.
 (21) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 (22) Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLESHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, 2019.
 (23) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 (24) Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efficiency meets robustness. In International Conference on Learning Representations, 2019.
 (25) Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018.
 (26) Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Datafree quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1325–1334, 2019.
 (27) Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
 (28) Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and finetuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
 (29) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 (30) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
 (31) Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. International Conference on Machine Learning (ICML), pages 7543–7552, June 2019.
 (32) Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
Comments
There are no comments yet.