Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification

04/20/2020 ∙ by Huanrui Yang, et al. ∙ Duke University Tsinghua University University of Nevada, Reno 10

Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple low-rank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce low-rank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its full-rank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Low-rank is encouraged by applying sparsity-inducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a low-rank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also state-of-the-art filter pruning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The booming development in deep learning models and applications has enabled beyond human performance in tasks like large-scale image classification 

[18, 9, 12, 13], object detection [27, 22, 8], and semantic segmentation [24, 3]. Such high performance, however, comes with a high price of large memory consumption and computation load. For example, a ResNet-50 model needs approximately

floating-point operations (FLOPs) to classify a color image of

pixels. The computation load can easily expand to tens or even hundreds of GFLOPs for detection or segmentation models using state-of-the-art DNNs as backbones [2]. This is a major challenge that prevents the deployment of modern DNN models on resource-constrained platforms, such as phones, smart sensors, and drones.

Model compression techniques for DNN models, including element-wise pruning [7, 21, 38], structural pruning [33, 25, 20], quantization [23, 31], and factorization [15, 39, 36, 35], have been extensively studied. Among these methods, quantization and element-wise pruning can effectively reduce model’s memory consumption, but require specific hardware to realize efficient computation. Structural pruning reduces the computation load by removing redundant filters or channels. However, the complicated structures adopted in some modern DNNs (i.e., ResNet or DenseNet) enforce strict constraints on the input/output dimensions of certain layers. This requires additional filter grouping during the pruning and filter rearranging after the pruning to make the pruned structure valid [32, 5]. Factorization method approximates the weight matrix of a layer with a multiplication of two or more low-rank matrices. It by nature keeps the input/output dimension of a layer unchanged, and therefore the resulted decomposed network can be supported by any common DNN computation architectures, without additional grouping and post-processing.

The previous investigation show that it is feasible to approximate the weight matrices of a pretrained DNN model with the multiplication of low-rank matrices [15, 39, 26, 4, 19]. But these methods may greatly degrade the performance, even after post-hoc finetuning. Some other methods attempt to manipulate the “directions” of filters to implicitly reduce the rank of weight matrices [34, 20]. However, the difficulties in training and the implicitness of rank representation prevent these methods from reaching a high compression rate. Nuclear norm regularizer has been used to directly reduce the rank of weight matrices [1, 35]. Optimizing the nuclear norm requires conducting singular value decomposition (SVD) on every training step, which is inefficient, especially when dealing with larger models.

Our work aims to explicitly achieve a low-rank DNN network during the training without applying SVD on every step. In particular, we propose SVD training by training the weight matrix of each layer in the form of its full-rank SVD. The weight matrix is decomposed into the matrices of left-singular vectors, singular values and right-singular vectors, and the training is done on the decomposed variables. Furthermore, two techniques are proposed to induce low-rank while maintaining high performance during the SVD training: (1) Singular vector orthogonality regularization which keeps the singular vector matrices close to unitary throughout the training. It mitigates gradient vanishing/exploding during the training, and provide a valid form of SVD to guarantee the effective rank reduction. (2) Singular value sparsification which applies sparsity-inducing regularizers on the singular values during the training to induce low-rank. The low-rank model is finally achieved through singular value pruning. We evaluate the individual contribution of each technique as well as the overall performance when putting them together via ablation studies. Results show that the proposed method constantly beats state of the art factorization and structural pruning methods on various tasks and model structures. To the best of our knowledge, this is the first algorithm to explicitly search for the optimal rank of each DNN layer during the training without performing the decomposition operation at each training step.

2 Related Works on low-rank DNNs

Approximating a weight matrix with the multiplication of low-rank matrices is a straightforward idea for compressing DNNs. Early works in this field focus on designing the matrix and tensor decomposition scheme, especially for the 4-D tensor of convolution kernel, so that the operation of a pretrained network layer can be closely approximated with cascaded low-rank layers 

[15, 39, 26, 4, 19]. Tensor decomposition technique, notably CP-decomposition, is applied in early works to directly decompose the 4-D convolution kernel into 4 consecutive low-rank convolutions [19]. However, such decomposition technique significantly increases the number of layers in the achieved network, making them harder to be finetuned towards good performance, especially when decomposing larger and deeper models [29]. Later works therefore propose to reshape the 4-D tensor into a 2-D matrix, apply matrix decomposition technique like SVD to decompose the matrix, and finally reshape them back to 4-D tensors to get two consecutive layers. Notably, Zhang et al. [39] propose channel-wise decomposition, which uses SVD to decompose a convolution layer with a kernel size into two consecutive layers with kernel sizes and , respectively. The computation reduction can be achieved by exploiting the channel-wise redundancy, e.g., channels with smaller singular values in both decomposed layers are removed. Similarly, Jaderberg et al. [15] propose to decompose a convolution layer into two consecutive layers with less channels in between. They further utilize the spatial-wise redundancy to reduce the size of convolution kernels in the decomposed layers to and , respectively. These methods provide a closed-form decomposition for each layer. If the SVD is done in full rank, these methods guarantee that the decomposed layers perform the same operation as the original layer. However, the weights of the pretrained model may not be low-rank by nature, so the manually imposed low-rank after decomposition by removing small singular values inevitably leads to high accuracy loss as the compression ratio increases [35].

Methods have been proposed to reduce the ranks of weight matrices during training process in order to achieve low-rank decomposition with low accuracy loss. Wen et al. [34] induce low rank by applying an “attractive force” regularizer to increase the correlation of different filters in a certain layer. Ding et al. [5] achieve a similar goal by optimizing with “centripetal SGD,” which moves multiple filters towards a set of clustering centers. Both methods can reduce the rank of the weight matrices without performing actual low-rank decomposition during the training. However, the rank representations in these methods are implicit, so the regularization effects are weak and may lead to sharp performance decrease when seeking for a high speedup. On the other hand, Alvarez et al. [1] and Xu et al. [35]

explicitly estimate and reduce the rank throughout the training by adding Nuclear Norm (defined as the sum of all singular values) regularizer to the training objective. These method require performing SVD to compute and optimize the Nuclear Norm of each layer on every optimization step. Since the complexity of the SVD operation is

and the gradient computation through SVD is not straightforward [6], performing SVD on every step is time consuming.

To explicitly achieve a low-rank network without performing costly decomposition on each training step, Tai et al. [29]

propose to directly train the network from scratch in the low-rank decomposed form, and add batch normalization 

[14] between decomposed layers to tackle the potential gradient vanishing or exploding problem caused by the doubling of layers after decomposition. However, the low-rank decomposed training scheme used in this line of works requires setting the rank of each layer before the training [29]. The manually chosen low rank may not lead to the optimal compression. Also, training the low-rank model from scratch will make the optimization harder, as lower rank implies lower model capacity [35].

3 Proposed Method

Figure 1: The training, compressing and finetuning pipeline of the proposed method.

Building upon previous works, we combine the ideas of decomposed training and trained low-rank in this work. As shown in Figure 1, the model will first be trained in a decomposed form through the full-rank SVD training, then undergoes singular value pruning for rank reduction, and finally be finetuned for further accuracy recovery. As we will explain in Section 3.1, the model will be trained in the form of the spatial-wise [15] or channel-wise decomposition [39] to avoid the time consuming SVD. Unlike the training procedure proposed by [29], we will train the decomposed model in its full-rank to preserve the model capacity. During the SVD training, we apply orthogonality regularization to the singular vector matrices and sparsity-inducing regularizers to the singular values of each layer, the details of which will be discussed in Section 3.2 and 3.3, respectively. Section 3.4 will elaborate the full objective of the SVD training and the overall model compression pipeline. This method is able to achieve optimal compression rate by inducing low-rank through training without the need for performing decomposition on every training step.

3.1 SVD training of deep neural networks

In this work, we propose to train the neural network in its singular value decomposition form, where each layer is decomposed into two consecutive layers with no additional operations in between. For a fully connected layer, the weight is a 2-D matrix with dimension . Following the form of SVD, can be directly decomposed into three variables as , with dimension , and . Both and shall be unitary matrices. In the full-rank setting where , can be exactly reconstructed as . For a neural network, this is equivalent to decomposing a layer with weight into two consecutive layers with weight and respectively.

For a convolution layer, the kernel can be represented as a 4-D tensor with dimension . Here represent the numbers of filters, the number of input channels, the width and the height of the filter respectively. This work mainly focuses on the channel-wise decomposition method [39] and the spatial-wise decomposition method [15] to decompose the convolution layer, as these methods have shown their effectiveness in previous CNN decomposition research. For channel-wise decomposition, is first reshaped to a 2-D matrix . is then decomposed with SVD into , and , where and are unitary matrices and . The original convolution layer is therefore decomposed into two consecutive layers with kernels reshaped from and reshaped from . Spatial-wise decomposition shares a similar process as the channel-wise decomposition. The major difference is that is now reshaped to and then decomposed into , , and with . The resulting decomposed layers would have kernels and respectively. [39] and [15] theoretically show that the decomposed layers can exactly replicate the function of the original convolution layer in the full-rank setting. Therefore training the decomposed model at full-rank should achieve a similar accuracy as training the original model.

During the SVD training, for each layer we use the variables from the decomposition, i.e., , instead of the original kernel or weight as the trainable variables in the network. The forward pass will be executed by converting the into a form of the two consecutive layers as demonstrated above, and the back propagation and optimization will be done directly with respect to the of each layer. In this way, we can access the singular value directly without performing the time-consuming SVD on each step.

Note that and need to be orthogonal to guarantee the low rank approximation can be done by removing small singular values, but this is not naturally induced by the decomposed training process. Therefore we add orthogonality regularization to and to tackle this problem as discussed in Section 3.2. Rank reduction is induced by adding sparsity-inducing regularizers to the of each layer, which will be discussed in Section 3.3.

3.2 Singular vectors orthogonality regularizer

In a standard SVD procedure, the resulted and should be orthogonal by construction, which provides theoretical guarantee for the low-rank approximation. However, and in each layer are treated as free trainable variables in the decomposed training process, so the orthogonality may not hold. Without the orthogonal property, it is unsafe to prune the singular value in even if it reaches a small value, because the corresponding singular vectors in and may have high energy and lead to a large difference to the result.

To make the form of SVD valid and enable effective rank reduction via singular value pruning, we introduce an orthogonality regularization loss to and as:

(1)

where is the Frobenius norm of matrix and is the rank of and . Note that the ranks of and are the same given their definition in the decomposed training procedure. Adding the orthogonality loss in Equation (1

) to the total loss function forces

s and s of all the layers close to be orthogonal matrices.

Beyond maintaining valid SVD form, the orthogonality regularization also bring additional benefit to the performance of the decomposed training process. The decomposed training process convert one layer in the original network to two consecutive layers, therefore doubles the number of layers. As aforementioned in [29], this may worsen the problem of exploding or vanishing gradient during the optimization, degrading the performance of the achieved model. Since the proposed orthogonality loss can keep all the columns of and to have the norms close to , it can effectively prevent the gradient to explode or vanish when passing through variable and , therefore helping the training process to achieve a higher accuracy. The accuracy gain brought by training with the orthogonality loss will be discussed in our ablation study in Section 4.1.

3.3 Singular values sparsity-inducing regularizer

With orthogonal singular vector matrices, reducing the rank of the decomposed network is equivalent to making the singular value vector of each layer sparse. Although the sparsity of a vector is directly represented by its norm, it is hard to optimize the norm through gradient-based methods. Inspired by the recent works in DNN pruning [21, 33], we use differentiable sparsity-inducing regularizer to make more elements in closer to zero, and apply post-train pruning to make the singular value vector sparse.

For the choice of the sparsity-inducing regularizer, the

norm has been commonly applied in feature selection 

[30] and DNN pruning [33]. The regularizer takes the form of , which is both almost everywhere differentiable and convex, making it friendly for optimization. Moreover, applying regularizer on the singular value is equivalent to regularizing with the nuclear norm of the original weight matrix, which is a popular approximation to the rank of a matrix [35].

However, the norm is proportional to the scaling of parameters, i.e., , with a non-negative constant . Therefore, minimizing the norm of will shrink all the singular values simultaneously. In such a situation, some singular values that are close to zero after training may still contain a large portion of the matrix’s energy. Pruning such singular values may undermine the performance of the neural network.

To mitigate the proportional scaling problem of the regularizer, previous works in compressed sensing have been using Hoyer regularizer to induce sparsity in solving non-negative matrix factorization [11] and blind deconvolution [16], where the Hoyer regularizer shows superior performance comparing to other methods. The Hoyer regularizer is formulated as

(2)

which is the ratio of the norm and the norm of a vector [16]. It can be easily seen that the Hoyer regularizer is almost everywhere differentiable and scale-invariant. The differentiable property implies that the Hoyer regularizer can be easily optimized as part of the objective function. The scale-invariant property shows that if we apply the Hoyer regularizer to , the total energy will be retained as the singular values getting sparser. Therefore most of the energy will be kept within the top singular values while the rest getting close to zero. This makes Hoyer regularizer attractive in our training process. The effectiveness of the regularizer and the regularizer is explored and compared in Section 4.3.

3.4 Overall objective and training procedure

With the analysis above, we propose the overall objective function of the decomposed training as:

(3)

Here is the training loss computed on the model with decomposed layers. denotes the orthogonality loss provided in Equation (1), which is calculated on the singular vector matrices and of layer and added up over all layers. is the sparsity-inducing regularization loss, applying to the vector of singular values of each layer. We explore the use of both the regularizer and the regularizer (Equation (2)) as in this work. and

are the decay parameters for the sparsity-inducing regularization loss and the orthogonality loss respectively, which are hyperparameters of the proposed training process.

can be chosen as a large positive number to enforce the orthogonality of singular vectors, and can be modified to explore the tradeoff between accuracy and FLOPs of the achieved low-rank model.

As shown in Figure 1, the low-rank decomposed network will be achieved through a three-stage process of full-rank SVD training, singular value pruning and low-rank finetuning. First we train a full-rank decomposed network using the objective function in Equation (3). Training at full rank enables the decomposed model to easily reach the performance of the original model, as there is no capacity loss during the full-rank decomposition. With the help of the sparsity-inducing regularizer, most of the singular values will be close to zero after the full-rank training process. Inspired by [35]’s work, we prune the singular values using an energy based threshold. For each layer we find a set with the largest number of singular values subject to:

(4)

where is a predefined energy threshold. We use the same threshold for all the layers in our experiments. When is small enough, the singular values in set and the corresponding singular vectors can be removed safely with negligible performance loss. The pruning step will dramatically reduce the rank of the decomposed layers. For a convolution layer with kernel , if we can reduce the rank of the decomposed layers to , the number of FLOPs for the convolution will be reduced by or when channel-wise or spatial-wise decomposition is applied, respectively. The resulted low-rank model will then be finetuned with set to zero for further performance recovery.

4 Experiment Results

In this section, we first perform ablation studies on the effectiveness of each part of our training procedure using ResNet models [9] on the CIFAR-10 dataset [17]

. We then apply the proposed decomposed training method on various DNN models on the CIFAR-10 dataset and the ImageNet ILSVRC-2012 dataset 

[28]. The training hyperparameters for these models can be found in Appendix A. Different hyperparameters are used to explore the accuracy-FLOPs trade-off induced by the proposed method. Our results constantly stay above the Pareto frontier of previous works.

4.1 Importance of the orthogonal constraints

Here we demonstrate the importance of adding the singular value orthogonality loss to the decomposed training process. We separately train two decomposed model with the same optimizer and hyperparameters, one with the orthogonality loss of and the other with . No sparsity-inducing regularizer is applied to the singular values in this set of experiments. The experiments are conducted on ResNet-56 and ResNet-110 models, both trained under channel-wise decomposition and spatial-wise decomposition. The CIFAR-10 dataset is used for training and testing. As shown in Table 1, the orthogonality loss enables the decomposed model to achieve similar or even better accuracy comparing to that of the original full model. On the contrary, training the decomposed model without the orthogonality loss will cause around 2% accuracy loss.

Model Accuracy (%)
ResNet-56 N/A 93.14
ResNet-56-Ch 1.0 93.28
ResNet-56-Ch 0.0 91.28
ResNet-56-Sp 1.0 93.36
ResNet-56-Sp 0.0 90.70
Model Accuracy (%)
ResNet-110 N/A 93.62
ResNet-110-Ch 1.0 93.58
ResNet-110-Ch 0.0 91.83
ResNet-110-Sp 1.0 93.93
ResNet-110-Sp 0.0 91.86
Table 1: Comparison of top 1 accuracy on CIFAR-10 of ResNet models after full-rank SVD training, with or without orthogonality loss. [-Ch] means using channel-wise decomposition and [-Sp] means using spatial-wise decomposition.

4.2 Comparison of decomposition methods

As mentioned in Section 3.1, we mainly consider the channel-wise and the spatial-wise decomposition method in this work. In this section, we compare the accuracy-#FLOPs tradeoff tendency of the channel-wise decomposition and the spatial-wise decomposition. The tradeoff tendency of both decomposition methods are explored by training the decomposed model with Hoyer regularizer of different strength ( in Equation (3)) on the singular values. The results are shown in Figure 2. For shallower networks like ResNet-20 or ResNet-32 models, the spatial-wise decomposition shows a large advantage comparing to the channel-wise decomposition in the experiments, achieving significantly higher compression rate at similar accuracy. However, with a deeper network like ResNet-56 or ResNet-110, these two decomposition methods perform similarly. As discussed in Section 3.1, spatial-wise decomposition can utilize both spatial-wise redundancy and channel-wise redundancy, while the channel-wise decomposition utilizes channel-wise redundancy only. The observations in this set of experiments indicate that as DNN models get deeper, the channel-wise redundancy will become a dominant factor comparing to the spatial-wise redundancy. This corresponds to the fact that deeper layers in modern DNN typically have significantly more channels than shallower layers, resulting in significant channel-wise redundancy.

Figure 2: Effect of different decomposition methods. All models are achieved with Hoyer regularizer for singular value sparsity. Dash lines show the approximated tendency of the accuracy-compression tradeoff. See Appendix B Table 3 for detailed data.

4.3 Comparison of sparsity-inducing regularizers

Under the same model decomposition scheme, the main factor related to the final compression rate and the performance of the compressed model would be the choice of sparsity-inducing regularizers for the singular values. As mentioned in Section 3.3, we mainly consider the use of the and the Hoyer regularizer in the proposed training scheme. In this section, we use spatial-wise decomposition setting to compare the effect of the regularizer and the Hoyer regularizer. A controlled group is also trained with no sparsity-inducing regularizer applied during the SVD training. The accuracy-#FLOPs tradeoff is explored by changing the regularization strength and singular value pruning threshold. All other hyperparameters are kept the same during SVD training and fintuning process for all models. Results are shown in Figure 3. The tradeoff tendency of the regularizer constantly demonstrates a larger slope than that of the Hoyer regularizer. Under low accuracy loss, the Hoyer regularizer achieves a higher compression rate comparing to that of the regularizer. However, if we are aiming for extremely high compression rate while allowing higher accuracy loss, the regularizer can have a better performance. One possible reason for the difference in tendency is that the regularizer will make all the singular values small through the training process, while the Hoyer regularizer will maintain the total energy of the singular values during the training, focusing more energy in larger singular values. Therefore more singular values can be removed from the decomposed model trained with the Hoyer regularizer without significantly hurting the performance of the model, resulting in higher compression rate at low accuracy loss. But it would be harder to keep most of the energy in a tiny amount of singular values than simply making everything closer to zero, therefore the regularizer may perform better in the case of extremely high speedup. Comparing to the controlled group with no sparsity-inducing regularization, both the regularizer and the Hoyer regularizer can achieve higher accuracy under similar compression rate, especially at high compression rate where the accuracy gap between with or without sparsity-inducing regularizer is more significant. Therefore applying sparsity-inducing regularizer on singular values is important for reaching a high performance low-rank model, as the weight will not naturally reach low-rank in training.

Figure 3: Effect of applying different sparsity-inducing regularizers during SVD training. All models are achieved with Spatial-wise decomposition. See Appendix B Table 3 for detailed data.

4.4 Effectiveness of the overall training procedure

To show the “full-rank SVD training, singular value pruning and low-rank finetuning” training framework proposed in Section 3.4 is essential for reaching a high performance low-rank model, we take the model architecture of the low-rank models achieved from the proposed training procedure, reinitialize all the weights with random values, and train the low-rank model from scratch. For fair comparison, the reinitialized low-rank model is trained using the same training objective and hyperparameter choices as the low-rank finetuning step in our framework. As shown in Table 2, with the same architecture and training process, training the low-rank model from scratch leads to around 2% testing accuracy loss comparing to the accuracy achieved by the proposed training procedure. This result correspond to the fact that low-rank models are harder to train from scratch due to their low capacity [35]. On the other hand, the full-rank SVD training step in our proposed framework provide sufficient capacity for the model to reach a high performance. Such high performance can still be preserved after singular value pruning, as the singular values are already sparse after the SVD training process.

Base Model Training method Accuracy (%)
ResNet-20 Our method 91.39
Speed Up: 3.26 From scratch 89.43
ResNet-32 Our method 91.76
Speed Up: 3.93 From scratch 90.55
ResNet-56 Our method 93.27
Speed Up: 3.75 From scratch 91.55
ResNet-110 Our method 93.47
Speed Up: 6.42 From scratch 91.03
Table 2: Comparison of top 1 accuracy of the achieved low-rank ResNet models on CIFAR-10 vs. training the same model architecture from scratch. All low-rank models reported here are achieved with spatial-wise decomposition and Hoyer regularizer.

4.5 Comparing with previous works

We apply the proposed SVD training framework on the ResNet-20, ResNet-32, ResNet-56 and ResNet-110 models on the CIFAR-10 dataset as well as the ResNet-18 and ResNet-50 model on the ImageNet ILSVRC-2012 dataset to compare the accuracy-#FLOPs tradeoff with previous methods. Here we mainly compare our method with state-of-the-art low-rank decomposition methods including Jaderberg et al. [15], Zhang et al. [39], TRP [35] and C-SGD [5], as well as recent filter pruning methods like NISP [37], SFP [10] and CNN-FCF [20]. The results of different models are shown in Figure 4. As analyzed in Section 4.3, the spatial-wise decomposition methods achieves significantly higher compression rate than the channel-wise decomposition in shallower networks, while similar performance can be achieved when compressing a deeper model. Thus we compare the results of only the spatial-wise decomposition against previous works for ResNet-20 and ResNet-32. For other deeper networks, we report the results for both channel-wise and spatial-wise decomposition. As most of the previous works focus on compressing the model with a small accuracy loss, here we use the Hoyer regularizer for the singular values sparsity, as it can achieve a better compression rate than the norm under low accuracy loss (see Section 4.3). We use multiple strength for the Hoyer regularizer to explore the accuracy-#FLOPs tradeoff, in order to compare against previous works with different accuracy levels. As shown in Figure 4, our proposed method can constantly achieve higher FLOPs reduction with less accuracy loss comparing to previous methods on different models and datasets. These comparison results prove that the proposed SVD training and singular value pruning scheme can effectively compress modern deep neural networks through low-rank decomposition.

Figure 4: Comparison of accuracy-#FLOPs tradeoff against previous methods. Being closer to the top-right corner indicates a better tradeoff point. See Appendix B Table 35 for detailed results of our method, and Table 68 for all previous methods we compare to.

5 Conclusion

In this work, we propose the SVD training framework, which incorporates the full-rank decomposed training, singular value pruning and low-rank finetuning to reach low-rank DNNs with minor accuracy loss. We decompose each DNN layer to its full-rank SVD form before the training and directly train with the decomposed singular vectors and singular values, so we can keep an explicit measure of layers’ ranks without performing the SVD on each step. Orthogonality regularizers are applied to the singular vectors during the training to keep the decomposed layers in a valid SVD form. And sparsity-inducing regularizers are applied to the singular values to explicitly induce low-rank layers.

Thorough experiments are done to analyse each proposed technique. We demonstrate that the orthogonality regularization on singular vectors is crucial to the performance of the decomposed training process. For decomposition methods, we find that the spatial-wise method performs better than channel-wise in shallower networks while the performances are similar for deeper models. For the sparsity-inducing regularizer, we show that higher compression rate can be achieved by Hoyer regularizer comparing to that of the regularizer under low accuracy loss. Our training framework is justified as training the low-rank model from scratch cannot reach the same accuracy achieved by our method. We further apply the proposed method to various depth of ResNet models on both CIFAR-10 and ImageNet dataset, where we find our accuracy-#FLOPs tradeoff constantly stays above the Pareto frontier of previous methods, including both factorization and structural pruning methods. These results prove that this work provides an effective way for learning low-rank deep neural networks.

Acknowledgments

This work was supported in part by NSF-1910299, NSF-1822085, DOE DE-SC0018064, and NSF IUCRC-1725456, as well as supports from Ergomotion, Inc.

References

  • [1] J. M. Alvarez and M. Salzmann (2017) Compression-aware training of deep networks. In Advances in Neural Information Processing Systems, pp. 856–867. Cited by: §1, §2.
  • [2] A. Canziani, A. Paszke, and E. Culurciello (2016) An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. Cited by: §1.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
  • [4] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §1, §2.
  • [5] X. Ding, G. Ding, Y. Guo, and J. Han (2019) Centripetal sgd for pruning very deep convolutional networks with complicated structure. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4943–4953. Cited by: Appendix B, §1, §2, §4.5.
  • [6] M. Giles (2008) An extended collection of matrix derivative results for forward and reverse mode algorithmic differentiation. an extended version of a paper that appeared in the proceedings of ad2008. In the 5th International Conference on Automatic Differentiation, Cited by: §2.
  • [7] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.
  • [10] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018)

    Soft filter pruning for accelerating deep convolutional neural networks

    .
    arXiv preprint arXiv:1808.06866. Cited by: Appendix B, §4.5.
  • [11] P. O. Hoyer (2004) Non-negative matrix factorization with sparseness constraints.

    Journal of machine learning research

    5 (Nov), pp. 1457–1469.
    Cited by: §3.3.
  • [12] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
  • [13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.
  • [15] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: Appendix B, §1, §1, §2, §3.1, §3, §4.5.
  • [16] D. Krishnan, T. Tay, and R. Fergus (2011) Blind deconvolution using a normalized sparsity measure. In CVPR 2011, pp. 233–240. Cited by: §3.3.
  • [17] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Appendix A, §4.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [19] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §1, §2.
  • [20] T. Li, B. Wu, Y. Yang, Y. Fan, Y. Zhang, and W. Liu (2019) Compressing convolutional neural networks via factorized convolutional filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3977–3986. Cited by: Appendix B, §1, §1, §4.5.
  • [21] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814. Cited by: §1, §3.3.
  • [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [23] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K. Cheng (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 722–737. Cited by: §1.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • [25] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §1.
  • [26] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas (2015) Acdc: a structured efficient linear layer. arXiv preprint arXiv:1511.05946. Cited by: §1, §2.
  • [27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Appendix A, §4.
  • [29] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. (2015) Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067. Cited by: §2, §2, §3.2, §3.
  • [30] R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §3.3.
  • [31] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1.
  • [32] W. Wen, Y. He, S. Rajbhandari, M. Zhang, W. Wang, F. Liu, B. Hu, Y. Chen, and H. Li (2017)

    Learning intrinsic sparse structures within long short-term memory

    .
    arXiv preprint arXiv:1709.05027. Cited by: §1.
  • [33] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §1, §3.3, §3.3.
  • [34] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li (2017) Coordinating filters for faster deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 658–666. Cited by: §1, §2.
  • [35] Y. Xu, Y. Li, S. Zhang, W. Wen, B. Wang, Y. Qi, Y. Chen, W. Lin, and H. Xiong (2018) Trained rank pruning for efficient deep neural networks. arXiv preprint arXiv:1812.02402. Cited by: Appendix B, §1, §1, §2, §2, §2, §3.3, §3.4, §4.4, §4.5.
  • [36] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang (2015) Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483. Cited by: §1.
  • [37] R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin, and L. S. Davis (2018)

    Nisp: pruning networks using neuron importance score propagation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: Appendix B, §4.5.
  • [38] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §1.
  • [39] X. Zhang, J. Zou, K. He, and J. Sun (2015) Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38 (10), pp. 1943–1955. Cited by: Appendix B, §1, §1, §2, §3.1, §3, §4.5.

Appendix A Experiment setups

Our experiments are done on the CIFAR-10 dataset [17] and the ImageNet ILSVRC-2012 dataset [28]

. We access both datasets via the API provided in the “TorchVision” Python package. As recommended in the PyTorch tutorial, we normalize the data and augment the data with random crop and random horizontal flip before the training. We use batch size 100 to train CIFAR-10 model and use 256 for the ImageNet model. For all the models on CIFAR-10, both the full-rank SVD training and the low-rank finetuning are trained for 164 epochs. The learning rate is set to 0.001 initially and decayed by 0.1 at epoch 81 and 122. For models on ImageNet, the full-rank SVD training is trained for 90 epochs, with initial learning rate 0.1 and learning rate decayed by 0.1 every 30 epochs. The low-rank finetuning is done for 60 epochs, starting at learning rate 0.01 and decay by 0.1 at epoch 30. We use pretrained full-rank decomposed model (trained with the orthogonality regularizer but without sparsity-inducing regularizer) to initialize the SVD training. SGD optimizer with momentum 0.9 is used for optimizing all the models, with weight decay 5e-4 for CIFAR-10 models and 1e-4 for ImageNet models. The accuracy reported in the experiment is the best validation accuracy achieved during the finetuning process.

During the SVD training, the decay parameter of the orthogonality regularizer is set to 1.0 for both channel-wise and spatial-wise decomposition on CIFAR-10. On ImageNet, for training the ResNet-18 model is set to 5.0 for both decomposition methods. For the ResNet-50 model, is set to 10.0 for channel-wise decomposition and 5.0 for spatial-wise decomposition. The decay parameter for the sparsity-inducing regularizer and the energy threshold used for singular value pruning are altered through different set of experiments to fully explore the accuracy-#FLOPs tradeoff. In most cases, the energy threshold is selected through a line search, where we find the highest percentage of energy that can be pruned without leading to a sudden accuracy drop. The and the energy thresholds used in each set of the experiments are reported alongside the experiment results in Appendix B.

Appendix B Detailed experiment results

In this section we list the exact data used to plot the experiment result figures in Section 4. The results of our proposed method with various choice of decomposition method and sparsity-inducing regularizer tested on the CIFAR-10 dataset are listed in Table 3. All of these data points are visualized in Figure 2 and Figure 3 to compare the tradeoff tendency under different conditions. As discussed in Section 4.5, the results of spatial-wise decomposition with the Hoyer regularizer for ResNet-20 and ResNet-32 are shown in Figure 4 to compare with previous methods. The results of both channel-wise and spatial-wise decomposition with the Hoyer regularizer are compared with previous methods in Figure 4 for ResNet-56 and ResNet-110. For experiments on ImageNet dataset, the results of our method for the ResNet-18 model are listed in Table 4, and the results of our method for the ResNet-50 model are listed in Table 5.

The baseline results of previous works on compressing CIFAR-10 and ImageNet models used for comparison in Figure 4 are listed in Table 6-8. As there are a large amount of previous works in this field, we only list the results of the most recent works here to show the state-of-the-art Pareto frontier. Therefore we choose state of the art low-rank compression methods like Jaderberg et al. [15], Zhang et al. [39], TRP [35] and C-SGD [5], as well as recent filter pruning methods like NISP [37], SFP [10] and CNN-FCF [20] as the baseline to compare our results against.

Model Reg Type Decay Energy Pruned Accuracy Gain(%) Speed Up
ResNet-20 Hoyer 0.03 1.5e-5 0.04 2.20
Channel 0.07 6.0e-6 -0.27 2.66
0.1 3.0e-6 -0.54 2.94
Base Acc: L1 0.01 7.0e-2 1.13 1.43
90.93% 0.001 2.7e-2 0.63 1.59
0.1 1.0e-1 0.32 2.10
0.3 1.0e-1 -0.48 2.84
None 0.0 1.9e-1 -0.37 2.03
0.0 2.8e-1 -0.52 2.54
0.0 3.3e-1 -0.61 2.88
ResNet-20 Hoyer 0.01 1.0e-3 0.40 3.26
Spatial 0.03 2.0e-5 -0.10 3.87
0.1 4.0e-6 -0.86 4.77
Base Acc: 0.01 7.0e-3 -1.03 5.16
90.99% L1 0.01 6.0e-2 0.58 2.26
0.1 1.0e-1 -0.52 3.55
0.3 1.0e-1 -0.83 4.79
None 0.0 2.9e-1 0.59 2.44
0.0 3.9e-1 -0.78 3.15
0.0 4.8e-1 -1.30 4.05
ResNet-32 Hoyer 0.003 3.0e-3 0.04 2.22
Channel 0.01 1.0e-4 -0.10 2.44
0.03 2.0e-6 -0.86 2.56
Base Acc: L1 0.03 2.0e-2 0.22 1.58
92.12% 0.1 1.0e-1 -0.21 2.84
0.3 5.0e-2 -0.96 3.08
None 0.0 1.8e-1 -0.30 2.23
0.0 2.1e-1 -0.11 2.41
0.0 2.3e-1 -0.27 2.51
ResNet-32 Hoyer 0.001 5.0e-2 0.52 2.56
Spatial 0.005 5.0e-3 -0.38 3.93
0.01 8.0e-4 -0.62 4.57
Base Acc: 0.03 8.0e-6 -1.12 5.30
92.14% L1 0.03 7.0e-2 0.13 2.60
0.1 2.5e-2 -0.34 4.20
0.1 1.5e-1 -0.96 5.32
None 0.0 3.8e-1 -0.60 3.62
0.0 4.8e-1 -1.76 4.71
0.0 5.3e-1 -2.14 5.34
ResNet-56 Hoyer 0.001 2.0e-2 0.39 2.70
Channel 0.003 1.0e-3 -0.29 3.49
0.01 7.0e-6 -0.41 4.35
Base Acc: 0.01 2.0e-5 -0.68 4.94
93.28% 0.03 3.0e-7 -1.20 5.16
L1 0.1 3.0e-2 -0.30 4.25
0.1 1.5e-1 -0.59 4.86
None 0.0 2.8e-1 -0.16 2.91
0.0 3.8e-1 -0.98 3.71
0.0 4.7e-1 -1.78 4.70
ResNet-56 Hoyer 0.001 3.0e-2 0.17 3.07
Spatial 0.003 1.0e-3 -0.09 3.75
0.01 1.0e-4 -0.70 5.43
Base Acc: 0.03 1.0e-6 -1.37 6.90
93.36% L1 0.03 5.0e-3 -0.24 3.19
0.03 5.0e-2 -0.90 5.61
0.03 2.5e-1 -1.38 6.76
None 0.0 2.8e-1 -0.18 2.96
0.0 4.7e-1 -0.47 4.76
0.0 5.2e-1 -2.22 5.43
ResNet-110 Hoyer 0.001 5.0e-3 0.38 3.85
Channel 0.003 3.0e-4 -0.34 5.00
0.01 3.0e-7 -0.60 6.66
Base Acc: 0.03 1.0e-6 -1.27 8.76
93.58% L1 0.03 1.0e-1 -0.28 5.02
0.03 3.0e-1 -1.27 7.44
None 0.0 3.7e-1 -0.32 4.26
0.0 4.6e-1 -1.86 5.44
0.0 5.5e-1 -2.59 7.03
ResNet-110 Hoyer 0.001 1.3e-2 0.10 4.75
Spatial 0.003 7.0e-4 -0.46 6.42
0.01 2.0e-5 -1.28 8.76
Base Acc: 0.03 2.0e-8 -2.03 10.06
93.93% L1 0.03 3.0e-2 -0.42 5.02
0.03 1.0e-1 -0.67 6.45
0.03 1.5e-1 -1.01 7.21
0.03 2.5e-1 -1.36 8.66
None 0.0 4.7e-1 -1.56 5.69
0.0 5.6e-1 -2.27 7.55
0.0 6.1e-1 -3.44 8.87
Table 3: Full results of applying the proposed method on ResNet models on the CIFAR-10 dataset with various hyperparameters. [Decay] marks the decay variable for the sparse regularization, i.e. . [Energy Pruned] means the energy threshold used for singular value pruning, i.e. . [Accuracy Gain] denotes the gain of Top-1 accuracy from the accuracy of the baseline full-rank model. [Speed Up] is computed as the ratio of #FLOPs of the original model and the achieved low-rank model.
Decompose Base Acc Decay Energy Pruned Accuracy Gain Speed Up
Channel 88.54% 0.002 5.0e-4 0.94% 1.45
0.003 1.0e-4 -1.28% 2.03
0.005 1.0e-4 -2.47% 2.98
0.01 1.0e-5 -4.20% 4.21
Spatial 88.54% 0.002 1.0e-4 0.67% 1.61
0.005 1.0e-4 -0.84% 2.98
0.01 1.0e-4 -3.13% 6.36
Table 4: Results of applying the proposed method on ResNet-18 model on the ImageNet dataset. Hoyer regularizer is used as the sparsity-inducing regularizer for the singular values. Top-5 validation accuracy is reported in the [Base Acc] and the [Accuracy Gain] columns. [Speed Up] is computed as the ratio of #FLOPs of the original model and the achieved low-rank model.
Decompose Base Acc Decay Energy Pruned Accuracy Gain Speed Up
Channel 91.72% 0.001 1.0e-4 0.02% 1.37
0.002 1.0e-4 -0.12% 1.92
0.003 5.0e-5 -0.54% 2.51
0.005 5.0e-5 -1.56% 4.17
Spatial 91.91% 0.0005 1.0e-3 0.06% 1.44
0.001 1.0e-4 -0.10% 1.79
0.002 2.0e-4 -1.09% 3.05
Table 5: Results of applying the proposed method on ResNet-50 model on the ImageNet dataset. Hoyer regularizer is used as the sparsity-inducing regularizer for the singular values. Top-5 validation accuracy is reported in the [Base Acc] and the [Accuracy Gain] columns. [Speed Up] is computed as the ratio of #FLOPs of the original model and the achieved low-rank model.
Method ResNet-20 ResNet-32 ResNet-56 ResNet-110
Accu. Sp. Up Accu. Sp. Up Accu. Sp. Up Accu. Sp. Up
Zhang et al. -3.61% 1.41 -2.76% 1.41 - - - -
Jaderberg et al. -2.25% 1.66 -2.29% 1.68 - - - -
TRP-Ch -0.43% 2.17 -0.72% 2.20 - - - -
TRP-Sp -0.37% 2.84 -0.75% 3.40 - - - -
SFP -1.37% 1.79 -0.55% 1.71 0.19% 1.70 0.18% 1.69
CNN-FCF -1.07% 1.71 -0.25% 1.73 0.24% 1.75 0.09% 1.76
-2.67% 3.17 -1.69% 3.36 -1.22% 3.44 -0.62% 2.55
C-SGD-5/8 - - - - 0.23% 2.55 0.03% 2.56
Nisp - - - - -0.03% 1.77 -0.18% 1.78
Table 6: Baselines on the CIFAR-10 dataset. [Accu.] means the Top-1 accuracy gain comparing to that of the full model. [Sp. Up] denotes speed up computed as the ratio of #FLOPs before and after the model compression. [-] is marked when no result is available in the paper.
Channel-wise Spatial-wise
Method Accu. Sp. Up Method Accu. Sp. Up
Zhang et al. -4.85% Jaderberg et al. -4.82%
Zhang et al. -4.10% TRP-Sp -1.80%
TRP-Ch -2.06% TRP-Sp -2.71%
TRP-Ch -2.91% TRP-Sp -3.24%
TRP-Ch -3.02%
Table 7: Baselines of compressing ResNet-18 model on the ImageNet dataset. [Accu.] means the Top-5 accuracy gain comparing to that of the full model. [Sp. Up] denotes speed up computed as the ratio of #FLOPs before and after the model compression.
Method Accu. Sp. Up Method Accu. Sp. Up
SFP -0.81% NISP-50-A -0.21%
CNN-FCF-A +0.26% NISP-50-B -0.89%
CNN-FCF-B -0.19% C-SGD-70 -0.10%
CNN-FCF-C -0.69% C-SGD-50 -0.29%
CNN-FCF-D -1.37% C-SGD-30 -0.47%
Table 8: Baselines of compressing ResNet-50 model on the ImageNet dataset. [Accu.] means the Top-5 accuracy gain comparing to that of the full model. [Sp. Up] denotes speed up computed as the ratio of #FLOPs before and after the model compression.