lowrankcnn
Lowrank convolutional neural networks
view repo
Large CNNs have delivered impressive performance in various computer vision applications. But the storage and computation requirements make it problematic for deploying these models on mobile devices. Recently, tensor decompositions have been used for speeding up CNNs. In this paper, we further develop the tensor decomposition technique. We propose a new algorithm for computing the lowrank tensor decomposition for removing the redundancy in the convolution kernels. The algorithm finds the exact global optimizer of the decomposition and is more effective than iterative methods. Based on the decomposition, we further propose a new method for training lowrank constrained CNNs from scratch. Interestingly, while achieving a significant speedup, sometimes the lowrank constrained CNNs delivers significantly better performance than their nonconstrained counterparts. On the CIFAR10 dataset, the proposed lowrank NIN model achieves 91.31% accuracy (without data augmentation), which also improves upon stateoftheart result. We evaluated the proposed method on CIFAR10 and ILSVRC12 datasets for a variety of modern CNNs, including AlexNet, NIN, VGG and GoogleNet with success. For example, the forward time of VGG16 is reduced by half while the performance is still comparable. Empirical success suggests that lowrank tensor decompositions can be a very useful tool for speeding up large CNNs.
READ FULL TEXT VIEW PDFLowrank convolutional neural networks
Over the course of three years, CNNs have revolutionized computer vision, setting new performance standards in many important applications, see e.g., Krizhevsky et al. (2012); Farabet et al. (2013); Long et al. (2014). The breakthrough has been made possible by the abundance of training data, the deployment of new computational hardware (most notably, GPUs and CPU clusters) and large models. These models typically require a huge number of parameters () to achieve stateoftheart performance, and may take weeks to train even with highend GPUs. On the other hand, there is a growing interest in deploying CNNs to lowend mobile devices. On such processors, the computational cost of applying the model becomes problematic, let alone training one, especially when realtime operation is needed. Storage of millions of parameters also complicates the deployment. Modern CNNs would find many more applications if both the computational cost and the storage requirement could be significantly reduced.
There are only a few recent works for speeding up CNNs. Denton et al. (2014) proposed some lowrank approximation and clustering schemes for the convolutional kernels. They achieved 2x speedup for a single convolutional layer with 1% drop in classification accuracy. Jaderberg et al. (2014) suggested using different tensor decomposition schemes, reporting a 4.5x speedup with 1% drop in accuracy in a text recognition application. Lebedev et al. (2014) further explored the use of CP decomposition to approximate the convolutional kernels. Vanhoucke et al. (2011) showed that using 8bit quantization of the parameters can result in significant speedup with minimal loss of accuracy. This method can be used in conjunction with lowrank approximations to achieve further speedup.
As convolution operations constitute the bulk of all computations in CNNs, simplifying the convolution layer would have a direct impact on the overall speedup. The convolution kernels in a typical CNN is a 4D tensor. The key observation is that there might be a significant amount of redundancy in the tensor. Ideas based on tensor decomposition seem to be a particularly promising way to remove the redundancy as suggested by some previous works.
In this paper, we further develop the tensor decomposition idea. Our method is based on Jaderberg et al. (2014), but has several significant improvements. The contributions are summarized as follows:
A new algorithm for computing the lowrank tensor decomposition. Lowrank tensor decompositions are nonconvex problems and difficult to compute in general, Jaderberg et al. (2014) use iterative schemes to get an approximate local solution. But we find that the particular form of lowrank decomposition in (Jaderberg et al., 2014) has an exact closed form solution which is the global optimum. Hence we obtain the best dataindependent approximation. Furthermore, computing the exact solution is much more effective than iterative schemes. As the tensor decomposition is the most important step in approximating CNNs, being able to obtain an exact solution efficiently thus provides great advantages.
A new method for training lowrank constrained CNNs from scratch. Most previous works only focus on improving testing time computation cost. This is achieved by approximating and finetuning a pretrained network. Based on the lowrank tensor decomposition, we find that the convolutional kernels can be parameterized in a way that naturally enforces the lowrank constraint. Networks parameterized in this lowrank constrained manner have more layers than their nonconstrained counterparts. While it is widely observed that deeper networks are harder to train, we are able to train very deep lowrank constrained CNNs with more than 30 layers with the help of a recent training technique called batch normalization Ioffe & Szegedy (2015).
Evaluation on large networks. Previous experiments in Jaderberg et al. (2014) and Denton et al. (2014) give some promises of the effectiveness of lowrank approximations. But these methods have not been tested extensively for large models and generic datasets. Moreover, as iterative methods are used to find the approximation, bad local minima may hurt performance. In this paper, we test the proposed method for various stateoftheart CNN models, including NIN (Lin et al., 2013), AlexNet (Krizhevsky et al., 2012), VGG (Simonyan & Zisserman, 2014) and GoogleNet (Szegedy et al., 2014). The datasets used include CIFAR10 and ILSVRC12. We achieved significant speedups for these models with comparable or even better performance. Success on a variety of CNN models give strong evidence that lowrank tensor decomposition can be a very useful tool for simplifying and improving deep CNNs.
Our numerical experiments show that significant speedup can be achieved with minimal loss of performance, which is consistent with previously reported results. Surprisingly, while all previous efforts report a slight decrease or no change in performance, we found a significant increase of classification accuracy in some cases. In particular, on the CIFAR10 dataset, we achieve 91.31% classification accuracy (without data augmentation) with the lowrank NIN model, which improves upon not only the original NIN but also upon stateoftheart results on this dataset. We are not aware of significant improvements with lowrank approximations being reported in the previous literature.
The rest of the paper is organized as follows. We discuss some related work in section 2. We then introduce our decomposition scheme in section 3. Results with typical networks including AlexNet, NIN, VGG and GoogleNet on CIFAR10 and ILSVRC12 datasets are reported in section 4. We conclude with the summary and discussion in Section 5.
Using lowrank filters to accelerate convolution has a long history. Classic examples include high dimensional DCT and wavelet systems constructed from 1D wavelets using tensor products. In the context of dictionary learning, learning separable 1D filters was suggested by Rigamonti et al. (2013).
More specific to CNNs, there are two works that are most related to ours: Jaderberg et al. (2014); Lebedev et al. (2014). For Jaderberg et al. (2014), in addition to the improvements summarized in the previous section, there is another difference in the approximation stage. In Jaderberg et al. (2014), the network is approximated layer by layer. After one layer is approximated by the lowrank filters, the parameters of that layer are fixed, and the layers above are finetuned based on a reconstruction error criterion. Our scheme finetunes the entire network simultaneously using a discriminative criterion. While Jaderberg et al. (2014) reported that discriminative finetuning was inefficient for their scheme, we found that it works very well in our case.
In Lebedev et al. (2014), CP decomposition of the kernel tensors is proposed. Lebedev et al. (2014) used nonlinear least squares to compute the CP decomposition. It is also based on the tensor decomposition idea, but our decomposition is based on a different scheme and has some numerical advantages. For the CP decomposition, finding the best lowrank approximation is an illposed problem, and the best rank approximation may not exist in the general case, regardless the choice of norm (de Silva & Lim, 2008). But for the proposed scheme, the decomposition always exists, and we have an exact closed form solution for the decomposition. In principle, both the CP decomposition scheme and the proposed scheme can be used to train CNNs from scratch. In the CP decomposition, one convolutional layer is replaced with four convolutional layers. Although the effective depth of the network remains the same, it makes optimization much harder as the gradients of the inserted layers are prone to explosion. Because of this, application of this scheme to larger and deeper models is still problematic due to numerical issues.
Lastly, different from both, we consider more and much larger models, which is more challenging. Thus our results provide strong evidence that lowrank approximations can be applicable to a variety of stateoftheart models.
In line with the method in Jaderberg et al. (2014), the proposed tensor decomposition scheme is based on a conceptually simple idea: replace the 4D convolutional kernel with two consecutive kernels with a lower rank. In the following, we introduce the details of the decomposition and the algorithms of using the decomposition to approximate a pretrained network and to train a new one.
Formally, a convolutional kernel in a CNN is a 4D tensor , where are the numbers of the output and input feature maps respectively and is the spatial kernel size. We also view as an 3D filter array and use notation to represent the th filter. Let be the input feature map. The output feature map is defined as
where the superscript is the index of the channels.
The goal is to find an approximation of that facilitates more efficient computation while maintaining the classification accuracy of the CNN. We propose the following scheme:
(1) 
where is a hyperparameter controlling the rank, is the horizontal filter, is the vertical filter (we have slightly abused the notations to make them concise, and
are both vectors in
). Both and are learnable parameters.With this form, the convolution becomes:
(2) 
The intuition behind this approximation scheme is to exploit the redundancy that exist both in the spatial dimensions and across channels. Note the convolutions in the above equation are all one dimensional in space.
We can estimate the reduction in computation with this scheme. Direct convolution by definition requires
operations. In the above scheme, the computational cost associated with the vertical filters is and with horizontal filters , giving a total computational cost of . Acceleration can be achieved if we choose . In principle, if , which is typical in the first layer of a CNN, the acceleration is about times.We learn the approximating parameters and by a twostep strategy. In the first step, we approximate the convolution kernel in each layer by minimizing (index of the layers are omitted for notation simplicity). Note that this step can be done in parallel as there is no interlayer dependence. Then we finetune the whole CNN based on the discriminative criterion of restoring classification accuracy.
Based on the approximation criterion introduced in the previous section, the objective function to be minimized is:
(3) 
This minimization problem has a closed form solution. This is summarized in the following theorem and the proof can be found in the appendix. The theorem gives us an efficient algorithm for computing the exact decomposition.
Define the following bijection that maps a tensor to a matrix , tensor element maps to , where
Define . Let
be the Singular Value Decomposition (SVD) of
. Let(4)  
then is a solution to .
Because of this Theorem, we call the filters lowrank constrained filters. Note that the solution to is not unique. Indeed, if is a solution, then is also a solution for any , but these solutions are equivalent in our application. An illustration of the closedform approximation is shown in Figure 1.
A different criterion which uses the data distribution is proposed in Denton et al. (2014). But minimization for this criterion is NPhard. The proof is also included in the appendix.
The algorithm provided by the above theorem is extremely fast. In our experiments, it completes in less than 1 second for most modern CNNs (AlexNet, VGG, GoogLeNet), as they have small convolutional kernels. Iterative algorithms (Denton et al. (2014); Jaderberg et al. (2014) take much longer, especially with the datadependent criterion. In addition, iterative algorithms often lead to bad local minimum, which leads to inferior performance even after finetuning. The proposed algorithm solves this issue, as it directly provides the global minimum, which is the best dataindependent approximation. Numerical demonstrations are given in section 4.
Using the above scheme to train a new CNN from scratch is conceptually straightforward. Simply parametrize the convolutional to be of the form in (1), and the rest is not very different from training a nonconstrained CNN. Here and are the trainable parameters. As each convolutional layer is parametrized as the composition of two convolutional layers, the resulting CNN has more layers than the original one. Although the effective depth of the new CNN is not increased, the additional layers make numerical optimization much more challenging due to exploding and vanishing gradients, especially for large networks. To handle this problem, we use a recent technique called Batch Normalization (BN) (Ioffe & Szegedy, 2015). BN transform normalizes the activations of the internal hidden units, hence it can be an effective way to deal with the exploding or vanishing gradients. It is reported in Ioffe & Szegedy (2015) that deeper networks can be trained with BN successfully, and larger learning rates can be used. Empirically, we find BN effective in learning the lowrank constrained networks. An illustration of transformation of a original convolutional layer into a lowrank constraint one is in Figure 2. More details can be found in the numerical experiments section.
In this section, we evaluate the proposed scheme on the CIFAR10 and the ILSVRC12 datasets with several CNN models.
CIFAR10 dataset is small by today’s standard, but it is a good testbed for new ideas. We deploy two models as baseline models; one is a customized CNN and the other is the NIN model. We compare their performance with their corresponding lowrank constrained versions. All models on this dataset are learned from scratch.
METHOD  WITHOUT AUG.  WITH AUG.  SPEEDUP 
CNN (ours)  15.12%  12.62%  1 
Lowrank CNN (ours)  14.50%  13.10%  2.9 
CNN + Dropout (ours)  13.90%  12.29%  
Lowrank CNN + Dropout (ours)  13.81%  11.41%  
NIN (ours)  10.12%  8.19%  1 
Lowrank NIN (ours)  8.69%  6.98%  1.5 
CNN + Maxout (Goodfellow et al., 2013)  11.68%  9.38%   
NIN (Lin et al., 2013)  10.41%  8.81%   
CNN (Srivastava et al., 2014)  12.61%     
NIN + APL units (Agostinelli et al., 2014)  9.59%  7.51%   
The configurations of the baseline models and their lowrank counterparts are outlined in Table 1. We substitute every single convolutional layer in the baseline models with two convolutional layers with parameter
introduced in the previous section. All other specifications of the network pairs are the same. Rectified Linear Unit (ReLU) is applied to every layer except for the last one. Our implementation of the NIN model is slightly different from the one introduced in
Lin et al. (2013). We did not replace the convolutional layer because this layer only constitutes a small fraction of the total execution time. Hence the efficiency gain of factorizing this layer is small.The networks are trained with back propagation to optimize the multinomial logistic regression objective. The batch size is
. The learning learning rate is initially set to and decreases by a factor ofevery time the validation error stops decreasing. Some models have dropout units with probability
inserted after every ReLU. For exact specifications of the parameters, the reader may check https://github.com/chengtaipu/lowrankcnn. We evaluated the performance of the models both with and without data augmentation. With data augmentation, the images are flipped horizontally with probability and translated in both directions by at most 1 pixel. Otherwise, we only subtract the mean of the images and normalize each channel. The results are listed in Table 2.The performance of the lowrank constrained versions of both networks are better than the baseline networks, with and without data augmentation. Notably, the lowrank NIN model outperforms the baseline NIN model by more than 1%. And as far as we know, this is also better than previously published results.
We then study how the empirical performance and speedup change as we vary the rank . We choose the CNN+Dropout as baseline model with data augmentation described above. The results are listed in Table 3.
LAYER  ACCURACY CHANGE  SPEEDUP (LAYER)  SPEEDUP (NET)  REDUCTIONS (WEIGHTS)  

First 
4  64  256  +0.69%  1.20  2.91  3.5 
8  64  256  +0.85%  1.13  2.87  1.8  
12  64  256  +0.94%  1.05  2.85  1.2  
Second  12  8  256  0.02%  7.13  3.21  47.5 
12  16  256  +0.50%  6.76  3.21  23.8  
12  32  256  +0.89%  6.13  3.13  12.0  
12  64  256  +0.94%  3.72  2.86  6.0  
12  128  256  +1.32%  2.38  2.58  3.0  
12  256  256  +1.40%  1.25  1.92  1.5  
Third  12  64  8  2.25%  6.98  3.11  52.5 
12  64  16  +0.21%  6.89  3.11  26.4  
12  64  32  +0.19%  5.82  3.10  13.3  
12  64  64  +0.19%  3.74  2.96  6.7  
12  64  128  +0.94%  2.38  2.86  3.3  
12  64  256  +1.75%  1.31  2.30  1.7 
The number of parameters in the network can be reduced by a large factor, especially for the second and third layers. Up to speedup for a specific layer and 23 speedup for the whole network can be achieved. In practice, it is difficult for the speedup to match the theoretical gains based on the number of operations, which is roughly proportional to the reduction of parameters. The actual gain also depends on the software and hardware optimization strategies of convolutions. Our results in Table 3
are based on Nvidia Titan GPUs and Torch 7 with cudnn backend.
Interestingly, even with significant reductions in the number of parameters, the performance does not decrease much. Most of the networks listed in Table 3 even outperform the baseline model. Applying the lowrank constraints for all convolutional layers, the total number of parameters in the convolutional layers can be reduced by a large factor without degrading much performance. For example, with and , the parameters in the convolutional kernels are reduced by 91% and the relative performance is +0.25%.
Nevertheless, the parameters in the fully connected layers still occupy a large fraction. This limits the overall compression ability of the lowrank constraint. There are some very recent works focusing on reducing the parameters in the fully connected layers (Novikov et al., 2015), combining these techniques with the proposed scheme will be explored in future research.
ILSVRC12 (Russakovsky et al., 2015) is a wellknown largescale benchmark dataset for image classification. We adopt three famous CNN models, AlexNet (Krizhevsky et al., 2012) (CaffeNet (Jia et al., 2014) as an variant), VGG16 (Simonyan & Zisserman, 2014), and GoogLeNet (Szegedy et al., 2014) (BNInception (Ioffe & Szegedy, 2015)
as an variant) as our baselines. The CaffeNet and VGG16 are directly downloaded from Caffe’s model zoo and then finetuned on the training set until convergence, while the BNInception model is trained from scratch by ourselves.
The introduced lowrank decomposition is applied to each convolutional layer that has kernel size greater than . Input images are first warped to and then cropped to or for different models. We use the single center crop during the testing stage, and evaluate the performance by the top5 accuracy on the validation set. Detailed training parameters are available at https://github.com/chengtaipu/lowrankcnn.
As before, the hyperparameter controls the tradeoff between the speedup factor and the classification performance of the lowrank models. Therefore, we first study its effect for each layer, and then use the information to configure the whole lowrank model for better overall performance. We decompose a specific layer with a different each time, while keeping the parameters of all the other layers fixed. The performance after finetuning with respect to the theoretical layer speedup is demonstrated in Figure 4. In general, we choose for each layer the value of that most accelerates the forward computation while does not hurt the performance significantly (). A more automatic way for choosing is based on Eigengap, such that the first eigenvectors account for 95% of the variations. This is similar to choosing the number of principal components in PCA. The detailed lowrank model structures are listed in Table 4.



Lowrank models for ILSVRC12. For VGG16, each convolution module contains two or three subconvolutional layers. For GoogLeNet, each inception module contains one
and two consecutive convolutional layers. Their corresponding s are shown in a cell for brevity.The proposed closed form solution provides the optimal dataindependent initialization to the lowrank model. As indicated in Figure 4, there is a performance gap between the lowrank models and their baselines at the beginning, but the performance is restored after finetuning. It is claimed in Denton et al. (2014) that datadependent criterion leads to better performance, we found that this is true upon approximation, but after finetuning, the difference between the two criteria is negligible ().
At last, we compare the lowrank models with their baselines from the perspective of classification performance, as well as the time and space consumption. The results are summarized in Table 5. We can see that all the lowrank models achieve comparable performances. Those initialized with closed form weights approximation (cf. approximation rows in Table 5) are slightly inferior to their baselines. While the lowrank AlexNet trained from scratch with BN could achieve even better performance. This observation again reveals that the lowrank CNN structure could have better discriminative power and generalization ability. On the other hand, both the running time and the number of parameters are consistently reduced. Note that the large gaps between the theoretical and the actual speedup are mainly due to the CNN implementations, and the current BN operations significantly slow down the forward computation. This suggests room for accelerating the lowrank models by designing specific numerical algorithms.
METHOD 






AlexNet (original)  80.03%  
Lowrank (cf. approximation)  79.66%  
Lowrank (from scratch with BN)  80.56%  
VGG16 (original)  90.60%  
Lowrank (cf. approximation)  90.31%  
GoogLeNet (original)  92.21%  
Lowrank (cf. approximation)  91.79% 
In this paper, we explored using tensor decomposition techniques to speedup convolutional neural networks. We have introduced a new algorithm for computing the lowrank tensor decomposition and a new method for training lowrank constrained CNNs from scratch. The proposed method is evaluated on a variety of modern CNNs, including AlexNet, NIN, VGG, GoogleNet with success. This gives a strong evidence that lowrank tensor decomposition can be a generic tool for speeding up large CNNs.
On the the other hand, the interesting fact that the lowrank constrained CNNs sometimes outperform their nonconstrained counterparts points to two things. One is the local minima issue. Although the expressive power of lowrank constrained CNNs is strictly smaller than that of the nonconstrained one, we have observed in some cases that the former have smaller training error. This seems to suggest the lowrank form helps the CNNs begin with a better initialization and settles at a better local minimum. The other issue is overfitting. This is shown by the observation that in many cases the constrained model has higher training error but generalizes better. Overall, this suggests room for improvement in both the numerical algorithms and the regularizations of the CNN models.
This work is supported in part by the 973 project 2015CB856000 of the Chinese Ministry of Science and Technology and the DOE grant DESC0009248.