1 Introduction
The booming development in deep learning models and applications has enabled beyond human performance in tasks like largescale image classification
[18, 9, 12, 13], object detection [27, 22, 8], and semantic segmentation [24, 3]. Such high performance, however, comes with a high price of large memory consumption and computation load. For example, a ResNet50 model needs approximatelyfloatingpoint operations (FLOPs) to classify a color image of
pixels. The computation load can easily expand to tens or even hundreds of GFLOPs for detection or segmentation models using stateoftheart DNNs as backbones [2]. This is a major challenge that prevents the deployment of modern DNN models on resourceconstrained platforms, such as phones, smart sensors, and drones.Model compression techniques for DNN models, including elementwise pruning [7, 21, 38], structural pruning [33, 25, 20], quantization [23, 31], and factorization [15, 39, 36, 35], have been extensively studied. Among these methods, quantization and elementwise pruning can effectively reduce model’s memory consumption, but require specific hardware to realize efficient computation. Structural pruning reduces the computation load by removing redundant filters or channels. However, the complicated structures adopted in some modern DNNs (i.e., ResNet or DenseNet) enforce strict constraints on the input/output dimensions of certain layers. This requires additional filter grouping during the pruning and filter rearranging after the pruning to make the pruned structure valid [32, 5]. Factorization method approximates the weight matrix of a layer with a multiplication of two or more lowrank matrices. It by nature keeps the input/output dimension of a layer unchanged, and therefore the resulted decomposed network can be supported by any common DNN computation architectures, without additional grouping and postprocessing.
The previous investigation show that it is feasible to approximate the weight matrices of a pretrained DNN model with the multiplication of lowrank matrices [15, 39, 26, 4, 19]. But these methods may greatly degrade the performance, even after posthoc finetuning. Some other methods attempt to manipulate the “directions” of filters to implicitly reduce the rank of weight matrices [34, 20]. However, the difficulties in training and the implicitness of rank representation prevent these methods from reaching a high compression rate. Nuclear norm regularizer has been used to directly reduce the rank of weight matrices [1, 35]. Optimizing the nuclear norm requires conducting singular value decomposition (SVD) on every training step, which is inefficient, especially when dealing with larger models.
Our work aims to explicitly achieve a lowrank DNN network during the training without applying SVD on every step. In particular, we propose SVD training by training the weight matrix of each layer in the form of its fullrank SVD. The weight matrix is decomposed into the matrices of leftsingular vectors, singular values and rightsingular vectors, and the training is done on the decomposed variables. Furthermore, two techniques are proposed to induce lowrank while maintaining high performance during the SVD training: (1) Singular vector orthogonality regularization which keeps the singular vector matrices close to unitary throughout the training. It mitigates gradient vanishing/exploding during the training, and provide a valid form of SVD to guarantee the effective rank reduction. (2) Singular value sparsification which applies sparsityinducing regularizers on the singular values during the training to induce lowrank. The lowrank model is finally achieved through singular value pruning. We evaluate the individual contribution of each technique as well as the overall performance when putting them together via ablation studies. Results show that the proposed method constantly beats state of the art factorization and structural pruning methods on various tasks and model structures. To the best of our knowledge, this is the first algorithm to explicitly search for the optimal rank of each DNN layer during the training without performing the decomposition operation at each training step.
2 Related Works on lowrank DNNs
Approximating a weight matrix with the multiplication of lowrank matrices is a straightforward idea for compressing DNNs. Early works in this field focus on designing the matrix and tensor decomposition scheme, especially for the 4D tensor of convolution kernel, so that the operation of a pretrained network layer can be closely approximated with cascaded lowrank layers
[15, 39, 26, 4, 19]. Tensor decomposition technique, notably CPdecomposition, is applied in early works to directly decompose the 4D convolution kernel into 4 consecutive lowrank convolutions [19]. However, such decomposition technique significantly increases the number of layers in the achieved network, making them harder to be finetuned towards good performance, especially when decomposing larger and deeper models [29]. Later works therefore propose to reshape the 4D tensor into a 2D matrix, apply matrix decomposition technique like SVD to decompose the matrix, and finally reshape them back to 4D tensors to get two consecutive layers. Notably, Zhang et al. [39] propose channelwise decomposition, which uses SVD to decompose a convolution layer with a kernel size into two consecutive layers with kernel sizes and , respectively. The computation reduction can be achieved by exploiting the channelwise redundancy, e.g., channels with smaller singular values in both decomposed layers are removed. Similarly, Jaderberg et al. [15] propose to decompose a convolution layer into two consecutive layers with less channels in between. They further utilize the spatialwise redundancy to reduce the size of convolution kernels in the decomposed layers to and , respectively. These methods provide a closedform decomposition for each layer. If the SVD is done in full rank, these methods guarantee that the decomposed layers perform the same operation as the original layer. However, the weights of the pretrained model may not be lowrank by nature, so the manually imposed lowrank after decomposition by removing small singular values inevitably leads to high accuracy loss as the compression ratio increases [35].Methods have been proposed to reduce the ranks of weight matrices during training process in order to achieve lowrank decomposition with low accuracy loss. Wen et al. [34] induce low rank by applying an “attractive force” regularizer to increase the correlation of different filters in a certain layer. Ding et al. [5] achieve a similar goal by optimizing with “centripetal SGD,” which moves multiple filters towards a set of clustering centers. Both methods can reduce the rank of the weight matrices without performing actual lowrank decomposition during the training. However, the rank representations in these methods are implicit, so the regularization effects are weak and may lead to sharp performance decrease when seeking for a high speedup. On the other hand, Alvarez et al. [1] and Xu et al. [35]
explicitly estimate and reduce the rank throughout the training by adding Nuclear Norm (defined as the sum of all singular values) regularizer to the training objective. These method require performing SVD to compute and optimize the Nuclear Norm of each layer on every optimization step. Since the complexity of the SVD operation is
and the gradient computation through SVD is not straightforward [6], performing SVD on every step is time consuming.To explicitly achieve a lowrank network without performing costly decomposition on each training step, Tai et al. [29]
propose to directly train the network from scratch in the lowrank decomposed form, and add batch normalization
[14] between decomposed layers to tackle the potential gradient vanishing or exploding problem caused by the doubling of layers after decomposition. However, the lowrank decomposed training scheme used in this line of works requires setting the rank of each layer before the training [29]. The manually chosen low rank may not lead to the optimal compression. Also, training the lowrank model from scratch will make the optimization harder, as lower rank implies lower model capacity [35].3 Proposed Method
Building upon previous works, we combine the ideas of decomposed training and trained lowrank in this work. As shown in Figure 1, the model will first be trained in a decomposed form through the fullrank SVD training, then undergoes singular value pruning for rank reduction, and finally be finetuned for further accuracy recovery. As we will explain in Section 3.1, the model will be trained in the form of the spatialwise [15] or channelwise decomposition [39] to avoid the time consuming SVD. Unlike the training procedure proposed by [29], we will train the decomposed model in its fullrank to preserve the model capacity. During the SVD training, we apply orthogonality regularization to the singular vector matrices and sparsityinducing regularizers to the singular values of each layer, the details of which will be discussed in Section 3.2 and 3.3, respectively. Section 3.4 will elaborate the full objective of the SVD training and the overall model compression pipeline. This method is able to achieve optimal compression rate by inducing lowrank through training without the need for performing decomposition on every training step.
3.1 SVD training of deep neural networks
In this work, we propose to train the neural network in its singular value decomposition form, where each layer is decomposed into two consecutive layers with no additional operations in between. For a fully connected layer, the weight is a 2D matrix with dimension . Following the form of SVD, can be directly decomposed into three variables as , with dimension , and . Both and shall be unitary matrices. In the fullrank setting where , can be exactly reconstructed as . For a neural network, this is equivalent to decomposing a layer with weight into two consecutive layers with weight and respectively.
For a convolution layer, the kernel can be represented as a 4D tensor with dimension . Here represent the numbers of filters, the number of input channels, the width and the height of the filter respectively. This work mainly focuses on the channelwise decomposition method [39] and the spatialwise decomposition method [15] to decompose the convolution layer, as these methods have shown their effectiveness in previous CNN decomposition research. For channelwise decomposition, is first reshaped to a 2D matrix . is then decomposed with SVD into , and , where and are unitary matrices and . The original convolution layer is therefore decomposed into two consecutive layers with kernels reshaped from and reshaped from . Spatialwise decomposition shares a similar process as the channelwise decomposition. The major difference is that is now reshaped to and then decomposed into , , and with . The resulting decomposed layers would have kernels and respectively. [39] and [15] theoretically show that the decomposed layers can exactly replicate the function of the original convolution layer in the fullrank setting. Therefore training the decomposed model at fullrank should achieve a similar accuracy as training the original model.
During the SVD training, for each layer we use the variables from the decomposition, i.e., , instead of the original kernel or weight as the trainable variables in the network. The forward pass will be executed by converting the into a form of the two consecutive layers as demonstrated above, and the back propagation and optimization will be done directly with respect to the of each layer. In this way, we can access the singular value directly without performing the timeconsuming SVD on each step.
Note that and need to be orthogonal to guarantee the low rank approximation can be done by removing small singular values, but this is not naturally induced by the decomposed training process. Therefore we add orthogonality regularization to and to tackle this problem as discussed in Section 3.2. Rank reduction is induced by adding sparsityinducing regularizers to the of each layer, which will be discussed in Section 3.3.
3.2 Singular vectors orthogonality regularizer
In a standard SVD procedure, the resulted and should be orthogonal by construction, which provides theoretical guarantee for the lowrank approximation. However, and in each layer are treated as free trainable variables in the decomposed training process, so the orthogonality may not hold. Without the orthogonal property, it is unsafe to prune the singular value in even if it reaches a small value, because the corresponding singular vectors in and may have high energy and lead to a large difference to the result.
To make the form of SVD valid and enable effective rank reduction via singular value pruning, we introduce an orthogonality regularization loss to and as:
(1) 
where is the Frobenius norm of matrix and is the rank of and . Note that the ranks of and are the same given their definition in the decomposed training procedure. Adding the orthogonality loss in Equation (1
) to the total loss function forces
s and s of all the layers close to be orthogonal matrices.Beyond maintaining valid SVD form, the orthogonality regularization also bring additional benefit to the performance of the decomposed training process. The decomposed training process convert one layer in the original network to two consecutive layers, therefore doubles the number of layers. As aforementioned in [29], this may worsen the problem of exploding or vanishing gradient during the optimization, degrading the performance of the achieved model. Since the proposed orthogonality loss can keep all the columns of and to have the norms close to , it can effectively prevent the gradient to explode or vanish when passing through variable and , therefore helping the training process to achieve a higher accuracy. The accuracy gain brought by training with the orthogonality loss will be discussed in our ablation study in Section 4.1.
3.3 Singular values sparsityinducing regularizer
With orthogonal singular vector matrices, reducing the rank of the decomposed network is equivalent to making the singular value vector of each layer sparse. Although the sparsity of a vector is directly represented by its norm, it is hard to optimize the norm through gradientbased methods. Inspired by the recent works in DNN pruning [21, 33], we use differentiable sparsityinducing regularizer to make more elements in closer to zero, and apply posttrain pruning to make the singular value vector sparse.
For the choice of the sparsityinducing regularizer, the
norm has been commonly applied in feature selection
[30] and DNN pruning [33]. The regularizer takes the form of , which is both almost everywhere differentiable and convex, making it friendly for optimization. Moreover, applying regularizer on the singular value is equivalent to regularizing with the nuclear norm of the original weight matrix, which is a popular approximation to the rank of a matrix [35].However, the norm is proportional to the scaling of parameters, i.e., , with a nonnegative constant . Therefore, minimizing the norm of will shrink all the singular values simultaneously. In such a situation, some singular values that are close to zero after training may still contain a large portion of the matrix’s energy. Pruning such singular values may undermine the performance of the neural network.
To mitigate the proportional scaling problem of the regularizer, previous works in compressed sensing have been using Hoyer regularizer to induce sparsity in solving nonnegative matrix factorization [11] and blind deconvolution [16], where the Hoyer regularizer shows superior performance comparing to other methods. The Hoyer regularizer is formulated as
(2) 
which is the ratio of the norm and the norm of a vector [16]. It can be easily seen that the Hoyer regularizer is almost everywhere differentiable and scaleinvariant. The differentiable property implies that the Hoyer regularizer can be easily optimized as part of the objective function. The scaleinvariant property shows that if we apply the Hoyer regularizer to , the total energy will be retained as the singular values getting sparser. Therefore most of the energy will be kept within the top singular values while the rest getting close to zero. This makes Hoyer regularizer attractive in our training process. The effectiveness of the regularizer and the regularizer is explored and compared in Section 4.3.
3.4 Overall objective and training procedure
With the analysis above, we propose the overall objective function of the decomposed training as:
(3) 
Here is the training loss computed on the model with decomposed layers. denotes the orthogonality loss provided in Equation (1), which is calculated on the singular vector matrices and of layer and added up over all layers. is the sparsityinducing regularization loss, applying to the vector of singular values of each layer. We explore the use of both the regularizer and the regularizer (Equation (2)) as in this work. and
are the decay parameters for the sparsityinducing regularization loss and the orthogonality loss respectively, which are hyperparameters of the proposed training process.
can be chosen as a large positive number to enforce the orthogonality of singular vectors, and can be modified to explore the tradeoff between accuracy and FLOPs of the achieved lowrank model.As shown in Figure 1, the lowrank decomposed network will be achieved through a threestage process of fullrank SVD training, singular value pruning and lowrank finetuning. First we train a fullrank decomposed network using the objective function in Equation (3). Training at full rank enables the decomposed model to easily reach the performance of the original model, as there is no capacity loss during the fullrank decomposition. With the help of the sparsityinducing regularizer, most of the singular values will be close to zero after the fullrank training process. Inspired by [35]’s work, we prune the singular values using an energy based threshold. For each layer we find a set with the largest number of singular values subject to:
(4) 
where is a predefined energy threshold. We use the same threshold for all the layers in our experiments. When is small enough, the singular values in set and the corresponding singular vectors can be removed safely with negligible performance loss. The pruning step will dramatically reduce the rank of the decomposed layers. For a convolution layer with kernel , if we can reduce the rank of the decomposed layers to , the number of FLOPs for the convolution will be reduced by or when channelwise or spatialwise decomposition is applied, respectively. The resulted lowrank model will then be finetuned with set to zero for further performance recovery.
4 Experiment Results
In this section, we first perform ablation studies on the effectiveness of each part of our training procedure using ResNet models [9] on the CIFAR10 dataset [17]
. We then apply the proposed decomposed training method on various DNN models on the CIFAR10 dataset and the ImageNet ILSVRC2012 dataset
[28]. The training hyperparameters for these models can be found in Appendix A. Different hyperparameters are used to explore the accuracyFLOPs tradeoff induced by the proposed method. Our results constantly stay above the Pareto frontier of previous works.4.1 Importance of the orthogonal constraints
Here we demonstrate the importance of adding the singular value orthogonality loss to the decomposed training process. We separately train two decomposed model with the same optimizer and hyperparameters, one with the orthogonality loss of and the other with . No sparsityinducing regularizer is applied to the singular values in this set of experiments. The experiments are conducted on ResNet56 and ResNet110 models, both trained under channelwise decomposition and spatialwise decomposition. The CIFAR10 dataset is used for training and testing. As shown in Table 1, the orthogonality loss enables the decomposed model to achieve similar or even better accuracy comparing to that of the original full model. On the contrary, training the decomposed model without the orthogonality loss will cause around 2% accuracy loss.
Model  Accuracy (%)  
ResNet56  N/A  93.14 
ResNet56Ch  1.0  93.28 
ResNet56Ch  0.0  91.28 
ResNet56Sp  1.0  93.36 
ResNet56Sp  0.0  90.70 
Model  Accuracy (%)  
ResNet110  N/A  93.62 
ResNet110Ch  1.0  93.58 
ResNet110Ch  0.0  91.83 
ResNet110Sp  1.0  93.93 
ResNet110Sp  0.0  91.86 
4.2 Comparison of decomposition methods
As mentioned in Section 3.1, we mainly consider the channelwise and the spatialwise decomposition method in this work. In this section, we compare the accuracy#FLOPs tradeoff tendency of the channelwise decomposition and the spatialwise decomposition. The tradeoff tendency of both decomposition methods are explored by training the decomposed model with Hoyer regularizer of different strength ( in Equation (3)) on the singular values. The results are shown in Figure 2. For shallower networks like ResNet20 or ResNet32 models, the spatialwise decomposition shows a large advantage comparing to the channelwise decomposition in the experiments, achieving significantly higher compression rate at similar accuracy. However, with a deeper network like ResNet56 or ResNet110, these two decomposition methods perform similarly. As discussed in Section 3.1, spatialwise decomposition can utilize both spatialwise redundancy and channelwise redundancy, while the channelwise decomposition utilizes channelwise redundancy only. The observations in this set of experiments indicate that as DNN models get deeper, the channelwise redundancy will become a dominant factor comparing to the spatialwise redundancy. This corresponds to the fact that deeper layers in modern DNN typically have significantly more channels than shallower layers, resulting in significant channelwise redundancy.
4.3 Comparison of sparsityinducing regularizers
Under the same model decomposition scheme, the main factor related to the final compression rate and the performance of the compressed model would be the choice of sparsityinducing regularizers for the singular values. As mentioned in Section 3.3, we mainly consider the use of the and the Hoyer regularizer in the proposed training scheme. In this section, we use spatialwise decomposition setting to compare the effect of the regularizer and the Hoyer regularizer. A controlled group is also trained with no sparsityinducing regularizer applied during the SVD training. The accuracy#FLOPs tradeoff is explored by changing the regularization strength and singular value pruning threshold. All other hyperparameters are kept the same during SVD training and fintuning process for all models. Results are shown in Figure 3. The tradeoff tendency of the regularizer constantly demonstrates a larger slope than that of the Hoyer regularizer. Under low accuracy loss, the Hoyer regularizer achieves a higher compression rate comparing to that of the regularizer. However, if we are aiming for extremely high compression rate while allowing higher accuracy loss, the regularizer can have a better performance. One possible reason for the difference in tendency is that the regularizer will make all the singular values small through the training process, while the Hoyer regularizer will maintain the total energy of the singular values during the training, focusing more energy in larger singular values. Therefore more singular values can be removed from the decomposed model trained with the Hoyer regularizer without significantly hurting the performance of the model, resulting in higher compression rate at low accuracy loss. But it would be harder to keep most of the energy in a tiny amount of singular values than simply making everything closer to zero, therefore the regularizer may perform better in the case of extremely high speedup. Comparing to the controlled group with no sparsityinducing regularization, both the regularizer and the Hoyer regularizer can achieve higher accuracy under similar compression rate, especially at high compression rate where the accuracy gap between with or without sparsityinducing regularizer is more significant. Therefore applying sparsityinducing regularizer on singular values is important for reaching a high performance lowrank model, as the weight will not naturally reach lowrank in training.
4.4 Effectiveness of the overall training procedure
To show the “fullrank SVD training, singular value pruning and lowrank finetuning” training framework proposed in Section 3.4 is essential for reaching a high performance lowrank model, we take the model architecture of the lowrank models achieved from the proposed training procedure, reinitialize all the weights with random values, and train the lowrank model from scratch. For fair comparison, the reinitialized lowrank model is trained using the same training objective and hyperparameter choices as the lowrank finetuning step in our framework. As shown in Table 2, with the same architecture and training process, training the lowrank model from scratch leads to around 2% testing accuracy loss comparing to the accuracy achieved by the proposed training procedure. This result correspond to the fact that lowrank models are harder to train from scratch due to their low capacity [35]. On the other hand, the fullrank SVD training step in our proposed framework provide sufficient capacity for the model to reach a high performance. Such high performance can still be preserved after singular value pruning, as the singular values are already sparse after the SVD training process.
Base Model  Training method  Accuracy (%) 

ResNet20  Our method  91.39 
Speed Up: 3.26  From scratch  89.43 
ResNet32  Our method  91.76 
Speed Up: 3.93  From scratch  90.55 
ResNet56  Our method  93.27 
Speed Up: 3.75  From scratch  91.55 
ResNet110  Our method  93.47 
Speed Up: 6.42  From scratch  91.03 
4.5 Comparing with previous works
We apply the proposed SVD training framework on the ResNet20, ResNet32, ResNet56 and ResNet110 models on the CIFAR10 dataset as well as the ResNet18 and ResNet50 model on the ImageNet ILSVRC2012 dataset to compare the accuracy#FLOPs tradeoff with previous methods. Here we mainly compare our method with stateoftheart lowrank decomposition methods including Jaderberg et al. [15], Zhang et al. [39], TRP [35] and CSGD [5], as well as recent filter pruning methods like NISP [37], SFP [10] and CNNFCF [20]. The results of different models are shown in Figure 4. As analyzed in Section 4.3, the spatialwise decomposition methods achieves significantly higher compression rate than the channelwise decomposition in shallower networks, while similar performance can be achieved when compressing a deeper model. Thus we compare the results of only the spatialwise decomposition against previous works for ResNet20 and ResNet32. For other deeper networks, we report the results for both channelwise and spatialwise decomposition. As most of the previous works focus on compressing the model with a small accuracy loss, here we use the Hoyer regularizer for the singular values sparsity, as it can achieve a better compression rate than the norm under low accuracy loss (see Section 4.3). We use multiple strength for the Hoyer regularizer to explore the accuracy#FLOPs tradeoff, in order to compare against previous works with different accuracy levels. As shown in Figure 4, our proposed method can constantly achieve higher FLOPs reduction with less accuracy loss comparing to previous methods on different models and datasets. These comparison results prove that the proposed SVD training and singular value pruning scheme can effectively compress modern deep neural networks through lowrank decomposition.
5 Conclusion
In this work, we propose the SVD training framework, which incorporates the fullrank decomposed training, singular value pruning and lowrank finetuning to reach lowrank DNNs with minor accuracy loss. We decompose each DNN layer to its fullrank SVD form before the training and directly train with the decomposed singular vectors and singular values, so we can keep an explicit measure of layers’ ranks without performing the SVD on each step. Orthogonality regularizers are applied to the singular vectors during the training to keep the decomposed layers in a valid SVD form. And sparsityinducing regularizers are applied to the singular values to explicitly induce lowrank layers.
Thorough experiments are done to analyse each proposed technique. We demonstrate that the orthogonality regularization on singular vectors is crucial to the performance of the decomposed training process. For decomposition methods, we find that the spatialwise method performs better than channelwise in shallower networks while the performances are similar for deeper models. For the sparsityinducing regularizer, we show that higher compression rate can be achieved by Hoyer regularizer comparing to that of the regularizer under low accuracy loss. Our training framework is justified as training the lowrank model from scratch cannot reach the same accuracy achieved by our method. We further apply the proposed method to various depth of ResNet models on both CIFAR10 and ImageNet dataset, where we find our accuracy#FLOPs tradeoff constantly stays above the Pareto frontier of previous methods, including both factorization and structural pruning methods. These results prove that this work provides an effective way for learning lowrank deep neural networks.
Acknowledgments
This work was supported in part by NSF1910299, NSF1822085, DOE DESC0018064, and NSF IUCRC1725456, as well as supports from Ergomotion, Inc.
References
 [1] (2017) Compressionaware training of deep networks. In Advances in Neural Information Processing Systems, pp. 856–867. Cited by: §1, §2.
 [2] (2016) An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. Cited by: §1.
 [3] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
 [4] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1, §2.

[5]
(2019)
Centripetal sgd for pruning very deep convolutional networks with complicated structure.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4943–4953. Cited by: Appendix B, §1, §2, §4.5.  [6] (2008) An extended collection of matrix derivative results for forward and reverse mode algorithmic differentiation. an extended version of a paper that appeared in the proceedings of ad2008. In the 5th International Conference on Automatic Differentiation, Cited by: §2.
 [7] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
 [8] (2017) Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
 [9] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.

[10]
(2018)
Soft filter pruning for accelerating deep convolutional neural networks
. arXiv preprint arXiv:1808.06866. Cited by: Appendix B, §4.5. 
[11]
(2004)
Nonnegative matrix factorization with sparseness constraints.
Journal of machine learning research
5 (Nov), pp. 1457–1469. Cited by: §3.3.  [12] (2018) Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
 [13] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
 [14] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.
 [15] (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: Appendix B, §1, §1, §2, §3.1, §3, §4.5.
 [16] (2011) Blind deconvolution using a normalized sparsity measure. In CVPR 2011, pp. 233–240. Cited by: §3.3.
 [17] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Appendix A, §4.
 [18] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 [19] (2014) Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553. Cited by: §1, §2.
 [20] (2019) Compressing convolutional neural networks via factorized convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3977–3986. Cited by: Appendix B, §1, §1, §4.5.
 [21] (2015) Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814. Cited by: §1, §3.3.
 [22] (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
 [23] (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 722–737. Cited by: §1.
 [24] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
 [25] (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §1.
 [26] (2015) Acdc: a structured efficient linear layer. arXiv preprint arXiv:1511.05946. Cited by: §1, §2.
 [27] (2016) You only look once: unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
 [28] (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Appendix A, §4.
 [29] (2015) Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067. Cited by: §2, §2, §3.2, §3.
 [30] (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §3.3.
 [31] (2019) HAQ: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1.

[32]
(2017)
Learning intrinsic sparse structures within long shortterm memory
. arXiv preprint arXiv:1709.05027. Cited by: §1.  [33] (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1, §3.3, §3.3.
 [34] (2017) Coordinating filters for faster deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 658–666. Cited by: §1, §2.
 [35] (2018) Trained rank pruning for efficient deep neural networks. arXiv preprint arXiv:1812.02402. Cited by: Appendix B, §1, §1, §2, §2, §2, §3.3, §3.4, §4.4, §4.5.
 [36] (2015) Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483. Cited by: §1.

[37]
(2018)
Nisp: pruning networks using neuron importance score propagation
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: Appendix B, §4.5.  [38] (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §1.
 [39] (2015) Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: Appendix B, §1, §1, §2, §3.1, §3, §4.5.
Appendix A Experiment setups
Our experiments are done on the CIFAR10 dataset [17] and the ImageNet ILSVRC2012 dataset [28]
. We access both datasets via the API provided in the “TorchVision” Python package. As recommended in the PyTorch tutorial, we normalize the data and augment the data with random crop and random horizontal flip before the training. We use batch size 100 to train CIFAR10 model and use 256 for the ImageNet model. For all the models on CIFAR10, both the fullrank SVD training and the lowrank finetuning are trained for 164 epochs. The learning rate is set to 0.001 initially and decayed by 0.1 at epoch 81 and 122. For models on ImageNet, the fullrank SVD training is trained for 90 epochs, with initial learning rate 0.1 and learning rate decayed by 0.1 every 30 epochs. The lowrank finetuning is done for 60 epochs, starting at learning rate 0.01 and decay by 0.1 at epoch 30. We use pretrained fullrank decomposed model (trained with the orthogonality regularizer but without sparsityinducing regularizer) to initialize the SVD training. SGD optimizer with momentum 0.9 is used for optimizing all the models, with weight decay 5e4 for CIFAR10 models and 1e4 for ImageNet models. The accuracy reported in the experiment is the best validation accuracy achieved during the finetuning process.
During the SVD training, the decay parameter of the orthogonality regularizer is set to 1.0 for both channelwise and spatialwise decomposition on CIFAR10. On ImageNet, for training the ResNet18 model is set to 5.0 for both decomposition methods. For the ResNet50 model, is set to 10.0 for channelwise decomposition and 5.0 for spatialwise decomposition. The decay parameter for the sparsityinducing regularizer and the energy threshold used for singular value pruning are altered through different set of experiments to fully explore the accuracy#FLOPs tradeoff. In most cases, the energy threshold is selected through a line search, where we find the highest percentage of energy that can be pruned without leading to a sudden accuracy drop. The and the energy thresholds used in each set of the experiments are reported alongside the experiment results in Appendix B.
Appendix B Detailed experiment results
In this section we list the exact data used to plot the experiment result figures in Section 4. The results of our proposed method with various choice of decomposition method and sparsityinducing regularizer tested on the CIFAR10 dataset are listed in Table 3. All of these data points are visualized in Figure 2 and Figure 3 to compare the tradeoff tendency under different conditions. As discussed in Section 4.5, the results of spatialwise decomposition with the Hoyer regularizer for ResNet20 and ResNet32 are shown in Figure 4 to compare with previous methods. The results of both channelwise and spatialwise decomposition with the Hoyer regularizer are compared with previous methods in Figure 4 for ResNet56 and ResNet110. For experiments on ImageNet dataset, the results of our method for the ResNet18 model are listed in Table 4, and the results of our method for the ResNet50 model are listed in Table 5.
The baseline results of previous works on compressing CIFAR10 and ImageNet models used for comparison in Figure 4 are listed in Table 68. As there are a large amount of previous works in this field, we only list the results of the most recent works here to show the stateoftheart Pareto frontier. Therefore we choose state of the art lowrank compression methods like Jaderberg et al. [15], Zhang et al. [39], TRP [35] and CSGD [5], as well as recent filter pruning methods like NISP [37], SFP [10] and CNNFCF [20] as the baseline to compare our results against.
Model  Reg Type  Decay  Energy Pruned  Accuracy Gain(%)  Speed Up 

ResNet20  Hoyer  0.03  1.5e5  0.04  2.20 
Channel  0.07  6.0e6  0.27  2.66  
0.1  3.0e6  0.54  2.94  
Base Acc:  L1  0.01  7.0e2  1.13  1.43 
90.93%  0.001  2.7e2  0.63  1.59  
0.1  1.0e1  0.32  2.10  
0.3  1.0e1  0.48  2.84  
None  0.0  1.9e1  0.37  2.03  
0.0  2.8e1  0.52  2.54  
0.0  3.3e1  0.61  2.88  
ResNet20  Hoyer  0.01  1.0e3  0.40  3.26 
Spatial  0.03  2.0e5  0.10  3.87  
0.1  4.0e6  0.86  4.77  
Base Acc:  0.01  7.0e3  1.03  5.16  
90.99%  L1  0.01  6.0e2  0.58  2.26 
0.1  1.0e1  0.52  3.55  
0.3  1.0e1  0.83  4.79  
None  0.0  2.9e1  0.59  2.44  
0.0  3.9e1  0.78  3.15  
0.0  4.8e1  1.30  4.05  
ResNet32  Hoyer  0.003  3.0e3  0.04  2.22 
Channel  0.01  1.0e4  0.10  2.44  
0.03  2.0e6  0.86  2.56  
Base Acc:  L1  0.03  2.0e2  0.22  1.58 
92.12%  0.1  1.0e1  0.21  2.84  
0.3  5.0e2  0.96  3.08  
None  0.0  1.8e1  0.30  2.23  
0.0  2.1e1  0.11  2.41  
0.0  2.3e1  0.27  2.51  
ResNet32  Hoyer  0.001  5.0e2  0.52  2.56 
Spatial  0.005  5.0e3  0.38  3.93  
0.01  8.0e4  0.62  4.57  
Base Acc:  0.03  8.0e6  1.12  5.30  
92.14%  L1  0.03  7.0e2  0.13  2.60 
0.1  2.5e2  0.34  4.20  
0.1  1.5e1  0.96  5.32  
None  0.0  3.8e1  0.60  3.62  
0.0  4.8e1  1.76  4.71  
0.0  5.3e1  2.14  5.34  
ResNet56  Hoyer  0.001  2.0e2  0.39  2.70 
Channel  0.003  1.0e3  0.29  3.49  
0.01  7.0e6  0.41  4.35  
Base Acc:  0.01  2.0e5  0.68  4.94  
93.28%  0.03  3.0e7  1.20  5.16  
L1  0.1  3.0e2  0.30  4.25  
0.1  1.5e1  0.59  4.86  
None  0.0  2.8e1  0.16  2.91  
0.0  3.8e1  0.98  3.71  
0.0  4.7e1  1.78  4.70  
ResNet56  Hoyer  0.001  3.0e2  0.17  3.07 
Spatial  0.003  1.0e3  0.09  3.75  
0.01  1.0e4  0.70  5.43  
Base Acc:  0.03  1.0e6  1.37  6.90  
93.36%  L1  0.03  5.0e3  0.24  3.19 
0.03  5.0e2  0.90  5.61  
0.03  2.5e1  1.38  6.76  
None  0.0  2.8e1  0.18  2.96  
0.0  4.7e1  0.47  4.76  
0.0  5.2e1  2.22  5.43  
ResNet110  Hoyer  0.001  5.0e3  0.38  3.85 
Channel  0.003  3.0e4  0.34  5.00  
0.01  3.0e7  0.60  6.66  
Base Acc:  0.03  1.0e6  1.27  8.76  
93.58%  L1  0.03  1.0e1  0.28  5.02 
0.03  3.0e1  1.27  7.44  
None  0.0  3.7e1  0.32  4.26  
0.0  4.6e1  1.86  5.44  
0.0  5.5e1  2.59  7.03  
ResNet110  Hoyer  0.001  1.3e2  0.10  4.75 
Spatial  0.003  7.0e4  0.46  6.42  
0.01  2.0e5  1.28  8.76  
Base Acc:  0.03  2.0e8  2.03  10.06  
93.93%  L1  0.03  3.0e2  0.42  5.02 
0.03  1.0e1  0.67  6.45  
0.03  1.5e1  1.01  7.21  
0.03  2.5e1  1.36  8.66  
None  0.0  4.7e1  1.56  5.69  
0.0  5.6e1  2.27  7.55  
0.0  6.1e1  3.44  8.87 
Decompose  Base Acc  Decay  Energy Pruned  Accuracy Gain  Speed Up 

Channel  88.54%  0.002  5.0e4  0.94%  1.45 
0.003  1.0e4  1.28%  2.03  
0.005  1.0e4  2.47%  2.98  
0.01  1.0e5  4.20%  4.21  
Spatial  88.54%  0.002  1.0e4  0.67%  1.61 
0.005  1.0e4  0.84%  2.98  
0.01  1.0e4  3.13%  6.36 
Decompose  Base Acc  Decay  Energy Pruned  Accuracy Gain  Speed Up 

Channel  91.72%  0.001  1.0e4  0.02%  1.37 
0.002  1.0e4  0.12%  1.92  
0.003  5.0e5  0.54%  2.51  
0.005  5.0e5  1.56%  4.17  
Spatial  91.91%  0.0005  1.0e3  0.06%  1.44 
0.001  1.0e4  0.10%  1.79  
0.002  2.0e4  1.09%  3.05 
Method  ResNet20  ResNet32  ResNet56  ResNet110  

Accu.  Sp. Up  Accu.  Sp. Up  Accu.  Sp. Up  Accu.  Sp. Up  
Zhang et al.  3.61%  1.41  2.76%  1.41         
Jaderberg et al.  2.25%  1.66  2.29%  1.68         
TRPCh  0.43%  2.17  0.72%  2.20         
TRPSp  0.37%  2.84  0.75%  3.40         
SFP  1.37%  1.79  0.55%  1.71  0.19%  1.70  0.18%  1.69 
CNNFCF  1.07%  1.71  0.25%  1.73  0.24%  1.75  0.09%  1.76 
2.67%  3.17  1.69%  3.36  1.22%  3.44  0.62%  2.55  
CSGD5/8          0.23%  2.55  0.03%  2.56 
Nisp          0.03%  1.77  0.18%  1.78 
Channelwise  Spatialwise  

Method  Accu.  Sp. Up  Method  Accu.  Sp. Up 
Zhang et al.  4.85%  Jaderberg et al.  4.82%  
Zhang et al.  4.10%  TRPSp  1.80%  
TRPCh  2.06%  TRPSp  2.71%  
TRPCh  2.91%  TRPSp  3.24%  
TRPCh  3.02% 
Method  Accu.  Sp. Up  Method  Accu.  Sp. Up 

SFP  0.81%  NISP50A  0.21%  
CNNFCFA  +0.26%  NISP50B  0.89%  
CNNFCFB  0.19%  CSGD70  0.10%  
CNNFCFC  0.69%  CSGD50  0.29%  
CNNFCFD  1.37%  CSGD30  0.47% 
Comments
There are no comments yet.