1 Introduction
Deep Neural Networks (DNNs) have shown remarkable success in many computer vision tasks such as image classification
He2016DeepRL , object detection Ren2015FasterRT and semantic segmentation Chen2018DeepLabSI . Despite the high performance in large DNNs powered by cuttingedge parallel computing hardware, most of stateoftheart network architectures are not suitable for resource restricted usage such as usages on alwayson devices, batterypowered lowend devices, due to the limitations on computational capacity, memory and power.To address this problem, lowrank decomposition methods Denton2014Exploiting ; jaderberg2014speeding ; Guo2018Network ; Wen2017Coordinating ; Alvarez2017Compression have been proposed to minimize the channelwise and spatial redundancy by decomposing the original network into a compact one with lowrank layers. Different from precedent works, this paper proposes a novel approach to design lowrank networks.
Lowrank networks can be trained directly from scratch. However, it is difficult to obtain satisfactory results for several reasons. (1) Low capacity: Compared with the original full rank network, the capacity of a lowrank network is limited, which causes difficulties in optimizing its performances. (2) Deep structure: Lowrank decomposition typically doubles the number of layers in a network. The additional layers make numerical optimization much more challenging because of exploding and/or vanishing gradients. (3) Rank selection:
The rank of decomposed network is often chosen as a hyperparameter based on pretrained networks; which may not be the optimal rank for the network trained from scratch.
Alternatively, several previous works zhang2016accelerating ; Guo2018Network ; jaderberg2014speeding
attempted to decompose pretrained models in order to get initial lowrank networks. However, the heuristically imposed lowrank could incur huge accuracy loss and network retraining is required to recover the performance of the original network as much as possible. Some attempts were made to use sparsity regularization
Wen2017Coordinating ; chen2015compressing to constrain the network into a lowrank space. Though sparsity regularization reduces the error incurred by decomposition to some extent, performance still degrades rapidly when compression rate increases.In this paper, we propose a new method, namely Trained Rank Pruning (TRP), for training lowrank networks. We embed the lowrank decomposition into the training process by gradually pushing the weight distribution of a well functioning network into a lowrank form, where all parameters of the original network are kept and optimized to maintain its capacity. We also propose a stochastic subgradient descent optimized nuclear regularization that further constrains the weights in a lowrank space to boost the TRP.The proposed solution is illustrated in Fig. 1.
Overall, our contributions are summarized below.

A new training method called TRP is presented by explicitly embedding the lowrank decomposition into the network training;

A nuclear regularization is optimized by stochastic subgradient descent to boost the performance of the TRP;

Improving inference acceleration and reducing approximation accuracy loss in both channelwise and spatialwise decomposition methods.
2 Methodology
2.1 Preliminaries
Formally, the convolution filters in a layer can be denoted by a tensor
, where and are the number of filters and input channels, and are the height and width of the filters. An input of the convolution layer generates an output as . Channelwise correlation zhang2016accelerating and spatialwise correlation jaderberg2014speeding are explored to approximate convolution filters in a lowrank space. In this paper, we focus on these two decomposition schemes. However, unlike the previous works, we propose a new training scheme TRP to obtain a lowrank network without retraining after decomposition.2.2 Trained Rank Pruning
We propose a simple yet effective training scheme called Trained Rank Pruning (TRP) in a periodic fashion:
(1) 
where is a lowrank tensor approximation operator, is the learning rate, indexes the iteration and is the iteration of the operator , with being the period for the lowrank approximation.
We apply lowrank approximation every SGD iterations. This saves training time to a large extent. As illustrated in Fig. 1, for every iterations, we perform lowrank approximation on the original filters, while gradients are updated on the resultant lowrank form. Otherwise, the network is updated via the normal SGD. Our training scheme could be combined with arbitrary lowrank operators. In the proposed work, we choose the lowrank techniques proposed in jaderberg2014speeding and zhang2016accelerating
, both of which transform the 4dimensional filters into 2D matrix and then apply the truncated singular value decomposition (TSVD). The SVD of matrix
can be written as:(2) 
where is the singular value of with , and and
are the singular vectors. The parameterized TSVD (
) is to find the smallest integer such that(3) 
where is a predefined hyperparameter of the energypreserving ratio. After truncating the last singular values, we transform the lowrank 2D matrix back to 4D tensor.
2.3 Nuclear Norm Regularization
Nuclear norm is widely used in matrix completion problems. Recently, it is introduced to constrain the network into lowrank space during the training process Alvarez2017Compression .
(4) 
where
is the objective loss function, nuclear norm
is defined as , with the singular values of . is a hyperparameter setting the influence of the nuclear norm. In this paper, we utilize stochastic subgradient descent Avron2012EfficientAP to optimize nuclear norm regularization in the training process. Let be the SVD of and let be truncated to the first columns or rows, then is the subgradient of watson1992characterization . Thus, the subgradient of Eq. (4) in a layer is(5) 
The nuclear norm and loss function are optimized simultaneously during the training of the networks and can further be combined with the proposed TRP.
Method
Top1()
Speed up
Baseline
69.10
1.00
TRP1
65.46
1.81
TRP1+Nu
65.39
2.23
zhang2016accelerating
63.1
1.41
TRP2
65.51
TRP2+Nu
65.34
3.18
jaderberg2014speeding
62.80
2.00

3 Experiments
3.1 Implementation Details
We evaluate the performance of TRP scheme on two common datasets, CIFAR10 AlexCifar10 and ImageNet Deng2009ImageNetAL . We implement our TRP scheme with NVIDIA 1080 Ti GPUs. For training on CIFAR10, we start with base learning rate of
to train 164 epochs and degrade the value by a factor of
at the th and th epoch. For ImageNet, we directly finetune the model with TRP scheme from the pretrained baseline with learning rate for 10 epochs. For both of the datasets, we adopt SGD solver to update weight and set the weight decay value as and momentum value as . ^{1}^{1}footnotetext: the implementation of Guo2018Network3.2 Results on CIFAR10
As shown in Table (b)b, for both spatialwise (TRP1) and channelwise (TRP2) decomposition, the proposed TRP outperforms basic methods zhang2016accelerating ; jaderberg2014speeding on ResNet20 and ResNet56. Results become even better when nuclear regularization is used. For example, in the channelwise decomposition (TRP2) of ResNet56, results of TRP combined with nuclear regularization can even achieve speed up rate than zhang2016accelerating with same accuracy drop. Our method also outperforms filter pruning li2016pruning and channel pruning He_2017_ICCV . For example, the channel decomposed TRP trained ResNet56 can achieve accuracy with acceleration, while He_2017_ICCV is and li2016pruning is . With the help of nuclear regularization, our methods can obtain times of the acceleration rate of He_2017_ICCV and li2016pruning with higher accuracy.
3.3 Results on ImageNet
The results on ImageNet are shown in Table (e)e and Table (e)e. For ResNet18, our method outperforms the basic methods zhang2016accelerating ; jaderberg2014speeding . For example, in the channelwise decomposition, TRP obtains 1.81 speed up rate with 86.48% Top5 accuracy on ImageNet which outperforms both the datadriven zhang2016accelerating ^{1} and data independent zhang2016accelerating methods by a large margin. Nuclear regularization can increase the speed up rates with the same accuracy.
For ResNet50, to better validate the effectiveness of our method, we also compare the proposed TRP with He_2017_ICCV and Luo2017ThiNetAF . With speed up, our decomposed ResNet50 can obtain Top1 and Top5 accuracy which is much higher than Luo2017ThiNetAF . The TRP achieves acceleration which is higher than He_2017_ICCV with the same Top5 degrade.
4 Conclusion
In this paper, we propose a new scheme Trained Rank Pruning (TRP) for training lowrank networks. It leverages capacity and structure of the original network by embedding the lowrank approximation in the training process. Furthermore, we propose stochastic subgradient descent optimized nuclear norm regularization to boost the TRP. The proposed TRP can be incorporated with any lowrank decomposition method. On CIFAR10 and ImageNet datasets, we have shown that our methods can outperform basic methods both in channelwise decmposition and spatialwise decomposition.
References
 [1] J. M. Alvarez and M. Salzmann. Compressionaware training of deep networks. In NIPS, 2017.
 [2] H. Avron, S. Kale, S. P. Kasiviswanathan, and V. Sindhwani. Efficient and practical stochastic subgradient descent for nuclear norm regularization. In ICML, 2012.
 [3] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40:834–848, 2018.
 [4] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
 [5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. CVPR, 2009.
 [6] E. Denton, W. Zaremba, J. Bruna, Y. Lecun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
 [7] J. Guo, Y. Li, W. Lin, Y. Chen, and J. Li. Network decoupling: From regular to depthwise separable convolutions. In BMVC, 2018.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016.
 [9] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
 [10] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 [11] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science, 2009.
 [12] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 [13] J.H. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. ICCV, 2017.
 [14] J.H. Luo, H. Zhang, H.Y. Zhou, C.W. Xie, J. Wu, and W. Lin. Thinet: pruning cnn filters for a thinner net. TPAMI, 2018.
 [15] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. TPAMI, 39:1137–1149, 2015.
 [16] G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear algebra and its applications, 170:33–45, 1992.
 [17] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. In ICCV, 2017.
 [18] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. TPAMI, 38(10):1943–1955, 2016.
Comments
There are no comments yet.