1 Introduction
Deep Neural Networks (DNNs) have achieved recordbreaking accuracy in many image classification tasks [16] [24][25][10]. With the advances of algorithms, availability of database, and improvement in hardware performance, the depth of DNNs grows dramatically from a few to hundreds or even thousands of layers, enabling humanlevel performance [9]. However, deploying these large models on resourcelimited platforms, e.g., mobiles and autonomous cars, is very challenging due to the high demand in the computation resource and hence energy consumption.
Recently, many techniques to accelerate the testing process of deployed DNNs have been studied, such as weight sparsifying or connection pruning [8][7][28][23][22][6] [19]. These approaches require delicate hardware customization and/or software design to transfer sparsity into practical speedup. Unlike sparsitybased methods, LowRank Approximation (LRA) methods [22][4][5][12][11] [26][27][18][30][14]
directly decompose an original large model to a compact model with more lightweight layers. Thanks to the redundancy (correlation) among filters in DNNs, original weight tensors can be approximated by very lowrank basis. From the viewpoint of matrix computation, LRA approximates a large weight matrix by the product of two or more small ones to reduce computation complexity.
The lowrank basis of filters in the first layer of the convolutional neural network
[16] on CIFAR10. The lowrank basis is formed by the most significant principal filters that are obtained by PCA. Top: the lowrank basis of the original network. Bottom: the lowrank basis of the same network after applying Force Regularization. The number of red boxes indicates the required rank to reconstruct the original filters with error.Previous LRA methods mostly focus on how to decompose the pretrained weight tensors for maximizing the reduction of computation complexity, meanwhile retaining the classification accuracy. Instead, we propose to nudge the weights by additional gradients (attractive forces) to coordinate the filters to a more correlated state. Our approach aims to improve the correlation among filters and therefore obtain more lightweight DNNs through LRA. To the best of our knowledge, this is the first work to train DNNs toward lowerrank space such that LRA can achieve faster DNNs.
The motivation of this work is fundamental. It has been proven that trained filters are highly clustered and correlated [5][4][12]
. Suppose each filter is reshaped as a vector. A cluster of highlycorrelated vectors then will have small included angles. If we are able to coordinate these vectors toward a state with smaller included angles, the correlation of the filters within that cluster improves. Consequently, LRA can produce a DNN with lower ranks and higher computation efficiency.
We propose a Force Regularization to coordinate filters in DNNs. As demonstrated in Fig. 1, when using the same LRA method, say, crossfilter Principal Component Analysis (PCA) [30], applying Force Regularization can greatly reduce the required ranks from the original design (i.e., vs. ), while keeping the same approximation errors (). As we shall show in Section 5, applying Force Regularization in the training of stateoftheart DNNs will successfully obtain lowerrank DNNs and thus improve computation efficiency, e.g., speedup for AlexNet with small accuracy loss.
The contributions of our work include: (1) We propose an effective and easytoimplement Force Regularization to train DNNs for lowerrank approximation. To the best of our knowledge, this is the first work to manipulate the correlation among filters during training such that LRA can achieve faster DNNs; (2) DNNs manipulated by Force Regularization can have better initialization for the retraining of LRAdecomposed DNNs, resulting in faster convergence to better accuracy; (3) Those lightweight DNNs that have been aggressively compressed by our method can be further sparsified. That is, our method can be integrated with stateoftheart sparsitybased methods to potentially achieve faster computation; (4) Force Regularization can be easily generalized to Discrimination Regularization that can learn more discriminative filters to improve classification accuracy; (5) Our implementation is opensource on both CPUs and GPUs.
2 Related work
Lowrank approximation. LRA method decomposes a large model to a compact one with more lightweight layers by weight/tensor factorization. Denil et al. [4] studied different dictionaries to remove the redundancy between filters and channels in DNNs. Jaderberg et al. [12] explored filter and data reconstruction optimizations to attain optimal separable basis. Denton et al. [5] clustered filters, extended LRA (e.g., Singular Value Decomposition, SVD ) to largerscale DNNs, and achieved speedup for the first two layers with accuracy loss. Many new decomposition methods were proposed [11][26][18][30] and the effectiveness of LRA in stateoftheart DNNs were evaluated [24][25]. Similar evaluations on mobile devices were also reported [14][27]. Unlike them, we propose Force Regularization to coordinate DNN filters to more correlated states, in which lowerrank or more compact DNNs are achievable for faster computation.
Sparse deep neural networks. The studies on sparse DNNs can be categorized into two types: nonstructured [20][23][22][8][6] and structured [28][21][19][1] sparsity methods. The first category prunes each connection independently. Consequently, sparse weights are randomly distributed. The level of nonstructured sparsity is usually insufficient to achieve good practical speedup in modern hardware [28][19]. Software optimization [23][22] and hardware customization [7] are proposed to overcome this issue. Conversely, the structured approaches prune connections group by group, such that the sparsified DNNs have regular distribution of sparse weights. The regularity is friendly to modern hardware for acceleration. Our work is orthogonal to sparsitybased methods. More importantly, we find that DNNs accelerated by our method can be further sparsified by both nonstructured and structured sparsity methods, potentially achieving faster computation.
3 Correlated Filters and Their Approximation
The prior knowledge is that correlation exists among trained filters in DNNs and those filters lie in a lowrank space. For example, the coloragnostic filters [16] learned in the first layer of AlexNet lie in a hyperplane, where RGB channels at each pixel have the same value. Fig. 2 presents the results of Linear Discriminant Analysis (LDA) of the first convolutional filters in AlexNet and GoogLeNet
. The filters are normalized to unit vectors and colored to four clusters by kmeans clustering, and then projected to 2D space by LDA to maximize cluster separation. The figure indicates high correlation among filters within a cluster. A naïve approach of filter approximation is to use the centroid of a cluster to approximate filters within that cluster, thus, the number of clusters is the rank of the space. Essentially, kmeans clustering is a LRA
[2] method, although we will later show that other LRA methods can give better approximation. The motivation of this work is that if we are able to nudge filters during the training such that the filters within a cluster are coordinated closer and some adjacent clusters are even merged into one cluster, then more accurate filter approximation using lower rank can be achieved. We propose Force Regularization to realize it.Before introducing Force Regularization, we first mathematically formulate LRA of DNN filters. Theoretically, almost all LRA methods can gain lowerrank approximation upon our method because filters are coordinated to more correlated state. Instead of onerously replicating all of these LRA methods, we choose crossfilter approximation [4][30] and a stateoftheart work in [26] as our baselines.
Fig. 3 illustrates the crossfilter approximation of a convolutional layer. We assume all weights in a convolutional layer is a tensor , where and are the numbers of filters and input channels, and and are the spatial height and width of the filters, respectively. With input feature map , the th output feature map , where is the th filter. Because of the redundancy (or correlation) across the filters [4], tensor can be approximated by a linear combination of the basis of a lowrank space , such as
(1) 
Where is a scalar, and is the feature map generated by basis filter . Therefore, the output feature map is a linear combination of which can be interpreted as the feature map basis. Since the linear combination essentially is a convolution, the convolutional layer can be decomposed to two sequential lightweight convolutional layers as shown in Fig. 3. The original computation complexity is , where and is the height and width of output feature maps, respectively. After applying crossfilter LRA, the computation complexity is reduced to . The computation complexity decreases when the rank .
4 Force Regularization
4.1 Regularization by Attractive Forces
This section proposes Force Regularization from the perspective of physics. It is a gradientbased approach that adds extra gradients to data loss gradients. The data loss gradients aim to minimize classification error as traditional DNNs do. The extra gradients introduced by Force Regularization gently adjust the lengths and directions of data loss gradients so as to nudge filters to a more correlated state. With a good setup of hyperparameter, our method can coordinate more useful information of filters to a lowerrank space meanwhile maintain accuracy. Inspired by Newton’s Laws, we propose an intuitive, computationefficient and effective Force Regularization that uses attractive forces to coordinate filters.
Force Regularization: As illustrated in Fig. 4, suppose the filter is reshaped as a vector and normalized as , with their origin at . We introduce the pairwise attractive force on generated by . The gradient of Force Regularization to update filter is defined as
(2) 
where is the Euclidean norm. The regularization gradient in Eq. (2) is perpendicular to filter vector and can be efficiently computed by addition and multiplication. The final updating of weights by gradient descent is
(3) 
where is data loss, is learning rate and is the coefficient of Force Regularization to trade off the rank and accuracy. We select by crossvalidation in this work. The gradient of common weightwise regularization (e.g., norm) is omitted in Eq. (3) for simplicity.
Fig. 4 intuitively explains our method. Suppose each vector is a rigid stick and there is a particle fixed at the endpoint. The particle has unit mass, and the stick is massless and can freely spin around the origin. Given the pairwise attractive forces (e.g., universal gravitation) , Eq. (2) is the acceleration of particle . As the forces are attractive, neighbor particles tend to spin around the origin to assemble together. Although our regularizer seems to collapse all particles to one point which is the rankone space for most lightweight DNNs, there exist gradients of data loss to avoid this. More specific, pretrained filters orient to discriminative directions . In each direction , there are some correlated filters as observed in Fig. 2. During the subsequent retraining with our regularizer, regularization gradients coordinate a cluster of filters closer to a typical direction , but data loss gradients avoid collapsing together so as to maintain the filters’ capability of extracting discriminative features. If all filters could be extremely collapsed toward one point meanwhile maintain classification accuracy, it implies the filters are overredundant and we can attain a very efficient DNN by decomposing it to a rankone space.
We derive the Force Regularization gradient from the normalized filters based on the following facts: (1) A normalized filter is on the unit hypersphere, and its orientation is the only free parameter we need to optimize; (2) The gradient of can be easily scaled by the vector length without changing the angular velocity.
In Eq. (2), is the force function related to distance. We study norm Force
(4) 
and norm Force
(5) 
in this work. We define the force of Eq. (4) as norm Force because the strength linearly decreases with the distance , just as the gradient of regularization norm does. We name the force of Eq. (5) as norm Force because the gradient is a constant unit vector regardless of the distance, just as the gradient of sparsity regularization norm is.
4.2 Mathematical Implications
Scaler  Error  conv1^{*}  conv2  conv3 

0 (baseline)  18.0%  17/32  27/32  55/64 
17.9%  15/32  22/32  30/64  
18.0%  16/32  27/32  32/64  
^{*} The first convolutional layer. 
Net  Force  Top1 error  conv1  conv2  conv3  conv4  conv5  Average rank ratio ^{ ‡} 
ConvNet  None (baseline)^{†}  18.0%  17/32^{‡}  27/32  55/64  –  –  74.48% 
ConvNet  norm  17.9%  15/32  22/32  30/64  –  –  54.17% 
ConvNet  norm  18.0%  17/32  25/32  20/64  –  –  54.17% 
AlexNet  None (baseline)  42.63%  47/96  164/256  306/384  318/384  220/256  72.29% 
AlexNet  norm  42.70%  49/96  143/256  128/384  122/384  161/256  46.98% 
AlexNet  norm  42.45%  49/96  155/256  157/384  108/384  178/256  50.03% 
^{†}The baseline without Force Regularization. ^{‡}/: Low rank over full rank , which is defined as rank ratio. 
This section explains the mathematical implications behind: Force Regularization is related to but different from minimizing the sum of pairwise distances between normalized filters.
Theorem 1
Suppose filter is reshaped as a vector and normalized as . For each filter, Force Regularization under norm force has the same gradient direction of regularization , but differs by adapting the step size to the filer’s length, where
(6) 
Proof: Because ,
(7) 
where is a derivative matrix with element
(8) 
Superscripts index the elements in vectors and . is the unit impulse function:
(9) 
Therefore,
(10) 
Replacing Eq. (10) to Eq. (7), we have
(11) 
where . Therefore, Eq. (11) and Eq. (2) have the same direction.
Theorem 1 states that our proposed Force Regularization in Eq. (2) is related to Eq. (11). However, the step size of the gradient in Eq. (2) is scaled by the length of the filter instead of its reciprocal in Eq. (11). This ensures that the filter spins the same angle regardless of its length and avoids the issue of being divided by zero. Table 1 summarizes the ranks vs. step sizes for the ConvNet [16], which is trained by CIFAR10 database without data augmentation. The original ConvNet has , , and filters in each convolutional layer, respectively. The rank is the smallest number of basis filters (in Fig. 3) obtained by PCA with reconstruction error. Therefore, works better than its reciprocal when coordinating filters to a lowerrank space.
Following the same proof procedure, we can easily find that Force Regularization under norm Force has the same conclusion when
(12) 
5 Experiments
5.1 Implementation
Our experiments are performed in Caffe [13] using CIFAR10 [15]
and ILSVRC2012 ImageNet
[3]. Published models are adopted as the baselines: In CIFAR10, we choose ConvNet without data augmentation [16] and ResNets20 with data augmentation [10]. We adopt the same shortcut connections in [28] for ResNets20. For ImageNet, we use AlexNet and GoogLeNet models trained by Caffe, and report accuracy using only center crop of images.Our experiments of Force Regularization show that, with the same maximum iterations, the training from the baseline can achieve a better tradeoff between accuracy and speedup comparing with the training from scratch, because the baseline offers a good initial point for both accuracy and filter correlation. During the training with Force Regularization on CIFAR10, we use the same base learning rate as the baseline; while in ImageNet, base learning rate of the baseline is adopted.
5.2 Rank Analysis of Coordinated DNNs
In light of various lowrank approximation methods, without losing the generalization, we first adopt Principal Component Analysis (PCA) [30][22] to evaluate the effectiveness of Force Regularization. Specifically, the filter tensor can be reshaped to a matrix , the rows of which are the reshaped filters . PCA minimizes the least square reconstruction error when projecting a column of to a lowrank space . The reconstruction error is , where is the
th largest eigenvalue of covariance matrix
. Under the constraint of error percentage (e.g., ), lowerrank approximation can be obtained if the minimal rank can be smaller. In this section, without explicit explanation, we define rank of a convolutional layer as the minimal which has reconstruction error by PCA.Table 2 summarizes the rank in each layer of ConvNet and AlexNet without accuracy loss after Force Regularization. In the baselines, the learned filters in the front layers are intrinsically in a very lowrank space but the rank in deeper layers is high. This could explain why only speedups of the first two convolutional layers were reported in [5]. Fortunately, by using either norm or norm force, our method can efficiently maintain the low rank in the first two layers (e.g., conv1conv2 in AlexNet), meanwhile significantly reduce the rank of deeper layers (e.g., conv3conv5 in AlexNet). On average, our method can reduce the layerwise rank ratio by . The effectiveness of our method on deep layers is very important as the depth of modern DNNs grows dramatically [25][10]. Fig. 5 shows the rank of ResNets20 [10] and GoogLeNet [25] after Force Regularization, representing the scalability of our method on deeper DNNs. With an acceptable accuracy loss, layers in ResNets20 and layers in GoogLeNet are even coordinated to rank , which indicates those Inception blocks in GoogLeNet or Residual blocks in ResNets have been overparameterized and can be greatly simplified.
To study the tradeoff between rank, accuracy, and the pros and cons of norm and norm force, we conducted comprehensive experiments on AlexNet. As shown in Fig. 6, with mere 1.71% (1.80%) accuracy loss, the average rank ratio can be reduced to 28.59% (28.72%) using norm (norm) force. Very impressively, the rank of each group in conv4 can be reduced to one by norm force. The results also show that norm force is more effective than norm force when the rank ratio is high (e.g., conv2 and conv5), while norm force works better for layers whose potential rank ratios are low (e.g., conv3 and conv4). In general, norm force can better balance the ranks across all the layers.
Because Force Regularization coordinates more useful weight information in a lowrank space, it essentially can provide a better training initialization for the DNNs that are decomposed by LRA. Fig. 7 plots the training data loss and top1 validation error of AlexNet, which is decomposed to the same ranks by PCA. The baseline is the original AlexNet and the other AlexNet is coordinated by Force Regularization. The figure shows that the error sharply converges to a low level after a few iterations, indicating LRA provides a very good initialization for the lowrank DNNs. Training it from scratch has significant accuracy loss. More importantly, DNNs coordinated by Force Regularization can converge faster to a lower error.
Force  LRA  Top1 error 
None  PCA  43.21% 
SVD^{†}  43.27%  
kmeans^{†}  44.34%  
norm  PCA  43.25% 
SVD^{†}  43.20%  
kmeans^{†}  44.80%  
^{†} SVD and kmeans preserve the same ranks with PCA 
Besides PCA [22][30], we also evaluated the effectiveness of Force Regularization when integrating it with SVD [5][26] or kmeans clustering [5][2]. Table 3 compares the accuracies of AlexNet decomposed by different LRA methods. All LRAs preserve the same ranks in all layers, which means the decomposed AlexNet have the same network structure. In summary, PCA and SVD obtain similar accuracy and surpass kmeans clustering. Due to the limited pages, we adopt PCA as the representative in our study.
5.3 Acceleration of DNN Testing
In our experiments, we first train DNNs with Force Regularization, then decompose DNNs using LRA methods and finetune them to recover accuracy. In evaluation of speed, we omit small CIFAR10 database and focus on largescale DNNs on ImageNet, whose speed is a real concern. To prove the effective acceleration of Force Regularization, we adopt the speedup of stateoftheart LRAs [30][4][26] as our baseline. Our speedup is achieved in the case that the DNN filters are first coordinated by Force Regularization and then decomposed using the same LRAs. The practical GPU speed is profiled by the advanced hardware (NVIDIA GTX 1080) and software (cuDNN 5.0). The CPU speed is measured in Intel Xeon E52630 and ATLAS library. The batch size is 256.
Crossfilter LRA: We first evaluate the speedup of crossfilter LRA shown in Fig. 3. In previous works [5][26], the optimal rank in each layer can be selected layerbylayer using cross validation. However, the number of hyperparameters increases linearly with the depth of DNNs. To save development time, we utilize an identical error percentage across all layers as the single hyperparameter although layerwise rank selection may give better tradeoff. The rank in a layer is the minimal which has error .
As aforementioned in Section 5.2 and Table 2, the learned conv1 and conv2 of AlexNet are already in a very lowrank space and achieve good speedups using LRAs [5]. Thus we mainly focus on conv3conv5 here. Table 4 summarizes the speedups of PCA approximation of AlexNet with and without norm Force Regularization. With ignoble accuracy difference, Force Regularization successfully coordinates filters to a lowerrank space and accelerates the testing by a higher factor, comparing with the stateoftheart LRA. Similar results are observed when applying norm force.
Results in Table 4 also show that practical speedup is different from theoretical speedup. Generally, the difference is smaller in lowerperformance processors. In CPU mode of Table 4, Force Regularization achieves speedup of total convolutional time.
Force  Top1 error  conv3  conv4  conv5  

None  43.21%  rank  184  201  146 
norm  43.25%  rank  124  106  129 
None  43.21%  GPU  1.58  1.21  1.15 
norm  43.25%  GPU  2.16  2.03  1.33 
None  43.21%  CPU  1.78  1.60  1.47 
norm  43.25%  CPU  2.45  2.76  1.64 
None  43.21%  theoretical  1.79  1.72  1.63 
norm  43.25%  theoretical  2.65  3.26  1.85 
Speeding up stateoftheart LRA: We also duplicate the stateoftheart work [26] as the baseline^{2}^{2}2Code is provided by the authors in https://github.com/chengtaipu/lowrankcnn/ (). After LRA, AlexNet is finetuned with learning rate starting from 0.001 and divided by 10 at iteration 70,000 and 140,000. Finetuning terminates after 150,000 iterations.
The first row in Table 5 contains the results of the baseline [26], which don’t scale well to the advanced “TITAN 1080 + cuDNN 5.0” in conv3–5. This is because convolution is highly optimized in cuDNN 5.0, e.g., using Winograd’s minimal filtering algorithms [17]. However, the baseline decomposes the convolution to a pair of and convolution so that the optimized cuDNN is not fully exploited. This will be a common issue in the baseline, considering Winograd’s algorithm is universally used and convolution is one of the most common structures. We find that LRA in Fig. 3 can be utilized for conv 3–5 to solve this issue, because it can maintain the shape. We name this LRA as , which decomposes conv1–conv2 using LRA in [26] and conv 3–5 using LRA of Fig. 3. The second row in Table 5 shows that our can scale well to the hardware and software advances of “TITAN 1080 + cuDNN 5.0”. More importantly, Force Regularization on conv3–5 can enforce them to more lightweight layers and attain higher speedup factors than without using it. The result is shown in the third row, which in total achieves speedup for the whole convolution in GPU. With small accuracy loss in row 4 of Table 5, Force Regularization achieves speedup of total convolution on GPU and on CPU.
LRA  Force  Top5 err.  conv3  conv4  conv5  
[26]  None  20.65%  GPU  0.86  0.57  0.40 
None  19.93%  GPU  1.89  1.57  1.57  
norm  20.14%  GPU  2.25  2.03  1.60  
norm  21.68%  GPU  3.56  3.01  2.40  
CPU  4.81  4.00  2.92  
Method  Top5 err.  conv1  conv2  conv3  conv4  conv5  total 

AlexNet in Caffe  19.97%  1.00  1.00  1.00  1.00  1.00  1.00 
cpdecomposition [18]  20.97% (+1.00%)  –  4.00  –  –  –  1.27 
oneshot [14]  21.67% (+1.70%)  1.48  2.30  3.84  3.53  3.13  2.52 
SSL [28]  19.58% (0.39%)  1.00  1.27  1.64  1.68  1.32  1.35 
21.63% (+1.66%)  1.05  3.37  6.27  9.73  4.93  3.13  
our  20.14% (+0.17%)  2.61  6.06  2.48  2.20  1.58  2.69 
21.68% (+1.71%)  2.65  6.22  4.81  4.00  2.92  4.05  
Table 6
compares our method with stateoftheart DNN acceleration methods, in CPU mode. When the speedup of total time was not reported by the authors, we estimate it by the weighted average speedups over all layers, where the weighting coefficients are derived from the percentage of running time of each layer. In our hardware platform, conv1–conv5 respectively consume
, , , and testing time. The estimation is accurate, for example, we estimate of total time in oneshot [14], which is very close to reported by the authors. Comparing with both cpdecomposition and oneshot methods, our method can achieve higher accuracy and higher speedup. Comparing with SSL, with almost the same top5 error ( vs. ), we can attain higher speedup of vs. .deepcompression [7] reported to speedups in fullyconnected layers when batch size was 1. However, convolution is the bottleneck of DNNs, e.g., the convolution time in AlexNet is of the time in fullyconnected layers when profiled in our CPU platform. Moreover, no speedup was observed in the batching scenario as reported by the authors [7]. More importantly, as we will show in Section 5.4, our work can work together with sparsitybased methods (e.g., SSL or deepcompression) to obtain lowerrank and sparse DNNs and potentially further accelerate the testing of DNNs.
5.4 Lowerrank and Sparse DNNs
We sparsify the lightweight deep neural network (i.e., the first one of in Table 6), using Structured Sparsity Learning SSL [28] or nonstructured connectionpruning [23]. Note that Guided Sparsity Learning (GSL) is not adopted in our connectionpruning though better sparsity is achievable when applying it. Figure 8 summarizes the results.
Experiments prove that our method can work together with both structured and nonstructured sparsity methods to further compress and accelerate models. Comparing with deepcompression in Figure 8(a), our model has comparable compression rates but faster testing time. Typically, our model has higher compression rates in convolutional layers, which provides more space for computation reduction and generalizes better to modern DNNs (ResNets152 [10], for example, whose parameters in fc layers are only ). In Figure 8(b), our accelerated model can be further accelerated using SSL. The shapewise sparsity in conv3–5 of our model is slightly lower because our model is already aggressively compressed by LRA. The higher filterwise sparsity, however, implies the orthogonality of our approach to SSL.
5.5 Generalization of Force Regularization
Net  Regularization  Top1 error 
AlexNet  None (baseline)  42.63% 
AlexNet  norm force  41.71% 
AlexNet  norm force  41.53% 
ResNets20  None (baseline)  8.82% 
ResNets20  norm force  7.97% 
ResNets20  norm force  8.02% 
In convolutional layers, each filter basically extracts a discriminative feature, e.g., an orientationselective pattern or a color blob in the first layer [16] or a highlevel feature (e.g., textures, faces, etc.) in deeper layers [29]. The discrimination among filters is important for classification performance. Our method can coordinate filters for more lightweight DNNs meanwhile maintain the discrimination. It can also be generalized to learn more discriminative filters to improve the accuracy. The extension to Discrimination Regularization is straightforward but effective: the opposite gradient of Force Regularization (i.e., ) is utilized to update the filter. In this scenario, it works as the repulsive force to repel surrounding filters and enhance the discrimination. Table 7 summarizes the improved accuracy of stateoftheart DNNs.
Acknowledgments
This work was supported in part by NSF CCF1744082. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or their contractors.
References

[1]
J. M. Alvarez and M. Salzmann.
Learning the number of neurons in deep networks.
In Advances in Neural Information Processing Systems (NIPS), pages 2262–2270, 2016.  [2] C. Bauckhage. kmeans clustering is matrix factorization. arXiv:1512.07548, 2015.

[3]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2009. 
[4]
M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas.
Predicting parameters in deep learning.
In Advances in Neural Information Processing Systems (NIPS). 2013.  [5] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (NIPS). 2014.
 [6] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems (NIPS). 2016.
 [7] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv:1510.00149, 2015.
 [8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS). 2015.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In International Conference on Computer Vision (ICCV), 2015.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [11] Y. Ioannou, D. P. Robertson, J. Shotton, R. Cipolla, and A. Criminisi. Training cnns with lowrank filters for efficient image classification. arXiv:1511.06744, 2015.
 [12] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference (BMVC), 2014.
 [13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
 [14] Y.D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv:1511.06530, 2015.
 [15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS). 2012.
 [17] A. Lavin. Fast algorithms for convolutional neural networks. arXiv:1509.09308, 2015.
 [18] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv:1412.6553, 2014.
 [19] V. Lebedev and V. Lempitsky. Fast convnets using groupwise brain damage. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [20] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), volume 2, pages 598–605, 1989.
 [21] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR), 2017.
 [22] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [23] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey. Faster cnns with direct sparse convolutions and guided pruning. In International Conference on Learning Representations (ICLR), 2017.
 [24] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [26] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neural networks with lowrank regularization. In International Conference on Learning Representations (ICLR), 2016.
 [27] P. Wang and J. Cheng. Accelerating convolutional neural networks for mobile applications. In Proceedings of the 2016 ACM on Multimedia Conference, 2016.
 [28] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NIPS). 2016.
 [29] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), 2014.
 [30] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):1943–1955, Oct 2016.