1 Introduction
Deep neural networks (DNN), especially deep convolutional neural networks (CNN), made remarkable success in visual tasks
Krizhevsky et al. [2012]Girshick et al. [2014]Simonyan and Zisserman [2014]Szegedy et al. [2015]He et al. [2015] by leveraging largescale networks learning from a huge volume of data. Deployment of such big models, however, is computationintensive and memoryintensive. To reduce computation cost, many studies are performed to compress the scale of DNN, including sparsity regularizationLiu et al. [2015], connection pruningHan et al. [2015a]Han et al. [2015b] and low rank approximation Denil et al. [2013]Denton et al. [2014]Jaderberg et al. [2014]Ioannou et al. [2015]Tai et al. [2015]. Sparsity regularization and connection pruning approaches, however, often produce nonstructured random connectivity in DNN and thus, irregular memory access that adversely impacts practical acceleration in hardware platforms. Figure 1 depicts practical speedup of each layer of a AlexNet, which is nonstructurally sparsified bynorm. Compared to original model, the accuracy loss of the sparsified model is controlled within 2%. Because of the poor data locality associated with the scattered weight distribution, the achieved speedups are either very limited or negative even the actual sparsity is high, say, >95%. We define sparsity as the ratio of zeros in this paper. In recently proposed low rank approximation approaches, the DNN is trained first and then each trained weight tensor is decomposed and approximated by a product of smaller factors. Finally, finetuning is performed to restore the model accuracy. Low rank approximation is able to achieve practical speedups because it coordinates model parameters in dense matrixes and avoids the locality problem of nonstructured sparsity regularization. However, low rank approximation can only obtain the compact structure within each layer, and the structures of the layers are fixed during finetuning such that costly reiterations of decomposing and finetuning are required to find an optimal weight approximation for performance speedup and accuracy retaining.
Inspired by the facts that (1) there is redundancy across filters and channels Jaderberg et al. [2014]; (2) shapes of filters are usually fixed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary computation imposed by this fixation; and (3) depth of the network is critical for classification but deeper layers cannot always guarantee a lower error because of the exploding gradients and degradation problem He et al. [2015], we propose Structured Sparsity Learning (SSL) method to directly learn a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a generic regularization to adaptively adjust mutiple structures in DNN, including structures of filters, channels, and filter shapes within each layer, and structure of depth beyond the layers. SSL combines structure regularization (on DNN for classification accuracy) with locality optimization (on memory access for computation efficiency), offering not only wellregularized big models with improved accuracy but greatly accelerated computation (e.g. 5.1 on CPU and 3.1 on GPU for AlexNet).
2 Related works
Connection pruning and weight sparsifying. Han et al. Han et al. [2015a]Han et al. [2015b] reduced number of parameters of AlexNet by 9 and VGG16 by 13 using connection pruning. Since most reduction is achieved on fullyconnected layers, the authors obtained 3 to 4 layerwise speedup for fullyconnected layers. However, no practical speedups of convolutional layers are observed because of the issue shown in Figure 1. As convolution is the computational bottleneck and many new DNNs use fewer fullyconnected layers, e.g., only 3.99% parameters of ResNet152 in He et al. [2015] are from fullyconnected layers, compression and acceleration on convolutional layers become essential. Liu et al. Liu et al. [2015] achieved >90% sparsity of convolutional layers in AlexNet with 2% accuracy loss, and bypassed the issue shown in Figure 1 by hardcoding the sparse weights into program, achieving layerwise 4.59 speedup on a CPU. In this work, we also focus on convolutional layers. Compared to the above techniques, our SSL method can coordinate sparse weights in adjacent memory space and achieve higher speedups with the same accuracy. Note that hardware and program optimizations can further boost the system performance on top of the level of SSL but are not covered in this work.
Low rank approximation. Denil et al. Denil et al. [2013] predicted 95% parameters in a DNN by exploiting the redundancy across filters and channels. Inspired by it, Jaderberg et al. Jaderberg et al. [2014] achieved 4.5 speedup on CPUs for scene text character recognition and Denton et al. Denton et al. [2014] achieved 2 speedups on both CPUs and GPUs for the first two layers. Both of the works used Low Rank Approximation (LRA) with 1% accuracy drop. Tai et al. [2015]Ioannou et al. [2015] improved and extended LRA to larger DNNs. However, the network structure compressed by LRA is fixed; reiterations of decomposing, training/finetuning, and crossvalidating are still needed to find an optimal structure for accuracy and speed tradeoff. As number of hyperparameters in LRA method increases linearly with layer depth Denton et al. [2014]Tai et al. [2015], the search space increases linearly or even polynomially for very deep DNNs. Comparing to LRA, our contributions are: (1) SSL can dynamically optimize the compactness of DNN structure with only one hyperparameter and no reiterations; (2) besides the redundancy within the layers, SSL also exploits the necessity of deep layers and reduce them; (3) DNN filters regularized by SSL have lower rank approximation, so it can work together with LRA for more efficient model compression.
Model structure learning. Group Lasso Yuan and Lin [2006] is an efficient regularization to learn sparse structures. Kim et al. Kim and Xing [2010] used group Lasso to regularize the structure of correlation tree for multitask regression problem and reduced prediction errors. Liu et al. Liu et al. [2015] utilized group Lasso to constrain the scale of the structure of LRA. To adapt DNN structure to different databases, Feng et al. Feng and Darrell [2015] learned the appropriate number of filters in DNN. Different from these prior arts, we apply group Lasso to regularize multiple DNN structures (filters, channels, filter shapes, and layer depth). Our source code can be found at https://github.com/wenwei202/caffe/tree/scnn.
3 Structured Sparsity Learning Method for DNNs
We focus mainly on the Structured Sparsity Learning (SSL) on convolutional layers to regularize the structure of DNNs. We first propose a generic method to regularize structures of DNN in Section 3.1, and then specify the method to structures of filters, channels, filter shapes and depth in section 3.2. Variants of formulations are also discussed from computational efficiency viewpoint in Section 3.3.
3.1 Proposed structured sparsity learning for generic structures
Suppose weights of convolutional layers in a DNN form a sequence of 4D tensors , where , , and are the dimensions of the th () weight tensor along the axes of filter, channel, spatial height and spatial width, respectively. denotes the number of convolutional layers. Then the proposed generic optimization target of a DNN with structured sparsity regularization can be formulated as:
(1) 
Here represents the collection of all weights in the DNN; is the loss on data; is nonstructured regularization applying on every weight, e.g., norm; and is the structured sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in some groups Yuan and Lin [2006]Kim and Xing [2010], we adopt it in our SSL. The regularization of group Lasso on a set of weights can be represented as , where is a group of partial weights in and is the total number of groups. Different groups may overlap. Here is the group Lasso, or , where is the number of weights in .
3.2 Structured sparsity learning for structures of filters, channels, filter shapes and depth
In SSL, the learned “structure” is decided by the way of splitting groups of . We investigate and formulate the filerwise, channelwise, shapewise, and depthwise structured sparsity in Figure 2. For simplicity, the term of Eq. (1) is omitted in the following formulation expressions.
Penalizing unimportant filers and channels. Suppose is the th filter and is the th channel of all filters in the th layer. The optimization target of learning the filterwise and channelwise structured sparsity can be defined as
(2) 
As indicated in Eq. (2), our approach tends to remove less important filters and channels. Note that zeroing out a filter in the th layer results in a dummy zero output feature map, which in turn makes a corresponding channel in the th layer useless. Hence, we combine the filterwise and channelwise structured sparsity in the learning simultaneously.
Learning arbitrary shapes of filers. As illustrated in Figure 2,
denotes the vector of all corresponding weights located at spatial position of
in the 2D filters across the th channel. Thus, we define as the shape fiber related to learning arbitrary filter shape because a homogeneous noncubic filter shape can be learned by zeroing out some shape fibers. The optimization target of learning shapes of filers becomes:(3) 
Regularizing layer depth. We also explore the depthwise sparsity to regularize the depth of DNNs in order to improve accuracy and reduce computation cost. The corresponding optimization target is
. Different from other discussed sparsification techniques, zeroing out all the filters in a layer will cut off the message propagation in the DNN so that the output neurons cannot perform any classification. Inspired by the structure of highway networks
Srivastava et al. [2015] and deep residual networks He et al. [2015], we propose to leverage the shortcuts across layers to solve this issue. As illustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still be forwarded through the shortcut.3.3 Structured sparsity learning for computationally efficient structures
All proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction. Moreover, some variants of the formulations of these schemes can directly learn structures that can be efficiently computed.
2Dfilterwise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D convolutions. To perform efficient convolution, we explored a finegrain variant of filterwise sparsity, namely, 2Dfilterwise sparsity, to spatially enforce group Lasso on each 2D filter of . The saved convolution is proportional to the percentage of the removed 2D filters. The finegrain version of filterwise sparsity can more efficiently reduce the computation associated with convolution: Because the group sizes are much smaller and thus the weight updating gradients are shaper, it helps group Lasso to quickly obtain a high ratio of zero groups for a largescale DNN.
Combination of filterwise and shapewise sparsity for GEMM. Convolutional computation in DNNs is commonly converted to modality of GEneral Matrix Multiplication (GEMM) by lowering weight tensors and feature tensors to matrices Chetlur et al. [2014]. For example, in Caffe Jia et al. [2014], a 3D filter is reshaped to a row in the weight matrix where each column is the collection of weights related to shapewise sparsity. Combining filterwise and shapewise sparsity can directly reduce the dimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use rowwise and columnwise sparsity as the interchangeable terminology of filterwise and shapewise sparsity, respectively.
4 Experiments
We evaluated the effectiveness of our SSL using published models on three databases – MNIST, CIFAR10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights are initialized by the baseline, and speedups are measured in matrixmatrix multiplication by Caffe in a singlethread Intel Xeon E52630 CPU .
4.1 LeNet and multilayer perceptron on MNIST
In the experiment of MNIST, we examined the effectiveness of SSL in two types of networks: LeNet LeCun et al. [1998] implemented by Caffe and a multilayer perceptron (MLP) network. Both networks were trained without data augmentation.
LeNet #  Error  Filter # ^{§}  Channel # ^{§}  FLOP ^{§}  Speedup ^{§} 

1 (baseline)  0.9%  20—50  1—20  100%—100%  1.00—1.00 
2  0.8%  5—19  1—4  25%—7.6%  1.64—5.23 
3  1.0%  3—12  1—3  15%—3.6%  1.99—7.44 
^{§}In the order of conv1—conv2 
LeNet: When applying SSL to LeNet, we constrain the network with filterwise and channelwise sparsity in convolutional layers to penalize unimportant filters and channels. Table 1 summarizes the remained filters and channels, floatingpoint operations (FLOP), and practical speedups. In the table, LeNet 1 is the baseline and the others are the results after applying SSL in different strengths of structured sparsity regularization. The results show that our method achieves the similar error () with much fewer filters and channels, and saves significant FLOP and computation time.
To demonstrate the impact of SSL on the structures of filters, we present all learned conv1 filters in Figure 3. It can be seen that most filters in LeNet 2
are entirely zeroed out except for five most important detectors of stroke patterns that are sufficient for feature extraction. The accuracy of
LeNet 3 (that further removes the weakest and redundant stroke detector) drops only 0.2% from that of LeNet 2. Compared to the random and blurry filter patterns in LeNet 1 that resulted from the high freedom of parameter space, the filters in LeNet 2 & 3 are regularized and converge to smoother and more natural patterns. This explains why our proposed SSL obtains the samelevel accuracy but has much less filters. The smoothness of the filters are also observed in the deeper layers.LeNet #  Error  Filter size ^{§}  Channel #  FLOP  Speedup 
1 (baseline)  0.9%  25—500  1—20  100%—100%  1.00—1.00 
4  0.8%  21—41  1—2  8.4%—8.2%  2.33—6.93 
5  1.0%  7—14  1—1  1.4%—2.8%  5.19—10.82 
^{§} The sizes of filters after removing zero shape fibers, in the order of conv1—conv2 
The effectiveness of the shapewise sparsity on LeNet is summarized in Table 2. The baseline LeNet 1 has conv1 filters with a regular square (size = 25) while LeNet 5 reduces the dimension that can be constrained by a rectangle (size = 7). The 3D shape of conv2 filters in the baseline is also regularized to the 2D shape in LeNet 5 within only one channel, indicating that only one filter in conv1 is needed. This fact significantly saves FLOP and computation time.
MLP: Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e. the number of neurons) of fullyconnected layers. We enforce the group Lasso regularization on all the input (or output) connections of each neuron. A neuron whose input connections are all zeroed out can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a removable dummy neuron if all of its output connections are zeroed out.
Figure 4 summarizes the learned structure and FLOP of different MLP networks. The results show that SSL can not only remove hidden neurons but also discover the sparsity of images. For example, Figure 4 depicts the number of connections of each input neuron in MLP 2, where 40.18% of input neurons have zero connections and they concentrate at the boundary of the image. Such a distribution is consistent with our intuition: handwriting digits are usually written in the center and pixels close to the boundary contain little discriminative classification information.
4.2 ConvNet and ResNet on CIFAR10
We implemented the ConvNet of Krizhevsky et al. [2012] and deep residual networks (ResNet) He et al. [2015] on CIFAR10. When regularizing filters, channels, and filter shapes, the results and observations of both networks are similar to that of the MNIST experiment. Moreover, we simultaneously learn the filterwise and shapewise sparsity to reduce the dimension of weight matrix in GEMM of ConvNet. We also learn the depthwise sparsity of ResNet to regularize the depth of the DNNs.
ConvNet: We use the network from Alex Krizhevsky et al. Krizhevsky et al. [2012] as the baseline and implement it using Caffe. All the configurations remain the same as the original implementation except that we added a dropout layer with a ratio of 0.5 in the fullyconnected layer to avoid overfitting. ConvNet is trained without data augmentation. Table 3 summarizes the results of three ConvNet networks. Here, the row/column sparsity of a weight matrix is defined as the percentage of allzero rows/columns. Figure 5 shows their learned conv1 filters. In Table 3, SSL can reduce the size of weight matrix in ConvNet 2 by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups without accuracy drop. Surprisingly, without SSL, four conv1 filters of the baseline are actually allzeros as shown in Figure 5, demonstrating the great potential of filter sparsity. When SSL is applied, half of conv1 filters in ConvNet 2 can be zeroed out without accuracy drop.
ConvNet #  Error  Row sparsity ^{§}  Column sparsity ^{§}  Speedup ^{§} 

1 (baseline)  17.9%  12.5%–0%–0%  0%–0%–0%  1.00–1.00–1.00 
2  17.9%  50.0%–28.1%–1.6%  0%–59.3%–35.1%  1.43–3.05–1.57 
3  16.9%  31.3%–0%–1.6%  0%–42.8%–9.8%  1.25–2.01–1.18 
^{§}in the order of conv1–conv2–conv3 
On the other hand, in ConvNet 3, SSL achieves 1.0% (0.16%) lower error with a model even smaller than the baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a better network structure (including the number of filters and filer shapes) to reduce the error.
ResNet: To investigate the necessary depth of DNNs required by SSL, we use a 20layer deep residual networks (ResNet20) proposed in He et al. [2015] as the baseline. The network has 19 convolutional layers and 1 fullyconnected layer. Identity shortcuts are utilized to connect the feature maps with the same dimension while 1
1 convolutional layers are chosen as shortcuts between the feature maps with different dimensions. Batch normalization
Ioffe and Szegedy [2015] is adopted after convolution and before activation. We use the same data augmentation and training hyperparameters as that in He et al. [2015]. The final error of baseline is 8.82%. In SSL, the depth of ResNet20 is regularized by depthwise sparsity. Group Lasso regularization is only enforced on the convolutional layers between each pair of shortcut endpoints, excluding the first convolutional layer and all convolutional shortcuts. After SSL converges, layers with all zero weights are removed and the net is finally finetuned with a base learning rate of 0.01, which is lower than that (i.e., 0.1) in the baseline.Figure 6 plots the trend of the error vs. the number of layers under different strengths of depth regularizations. Compared with original ResNet in He et al. [2015], SSL learns a ResNet with 14 layers (SSLResNet14) that reaching a lower error than the one of the baseline with 20 layers (ResNet20); SSLResNet18 and ResNet32 achieve an error of 7.40% and 7.51%, respectively. This result implies that SSL can work as a depth regularization to improve classification accuracy. Note that SSL can efficiently learn shallower DNNs without accuracy loss to reduce computation cost; however, it does not mean the depth of the network is not important. The trend in Figure 6 shows that the test error generally declines as more layers are preserved. A slight error rise of SSLResNet20 from SSLResNet18 shows the suboptimal selection of the depth in the group of “3232”.
4.3 AlexNet on ImageNet
To show the generalization of our method to large scale DNNs, we evaluate SSL using AlexNet with ILSVRC 2012. CaffeNet Jia et al. [2014] – the replication of AlexNet Krizhevsky et al. [2012] with mirror changes, is used in our experiment. All training images are rescaled to the size of 256256. A 227227 image is randomly cropped from each scaled image and mirrored for data augmentation and only the center crop is used for validation. The final top1 validation error is 42.63%. In SSL, AlexNet is first trained with structure regularization; when it converges, zero groups are removed to obtain a DNN with the new structure; finally, the network is finetuned without SSL to regain the accuracy.
(PCA) is utilized to perform dimensionality reduction to exploit filter redundancy. The eigenvectors corresponding to the largest eigenvalues are selected as basis of lowerdimensional space. Dash lines denote the results of the baselines and solid lines indicate the ones of the
AlexNet 5 in Table 4; (c) Speedups of norm and SSL on various CPU and GPU platforms (In labels of xaxis, T# is number of the maximum physical threads in Xeon CPU). AlexNet 1 and AlexNet 2 in Table 4 are used as testbenches.We first studied 2Dfilterwise and shapewise sparsity by exploring the tradeoffs between computation complexity and classification accuracy. Figure 7 shows the 2Dfilter sparsity (the ratio between the removed 2D filters and total 2D filters) and the saved FLOP of 2D convolutions vs. the validation error. In Figure 7, deeper layers generally have higher sparsity as the group size shrinks and the number of 2D filters grows. 2Dfilter sparsity regularization can reduce the total FLOP by 30%–40% without accuracy loss or reduce the error of AlexNet by 1% down to 41.69% by retaining the original number of parameters. Shapewise sparsity also obtains similar results – In Table 4, for example, AlexNet 5 achieves on average 1.4 layerwise speedup on both CPU and GPU without accuracy loss after shape regularization; The top1 error can also be reduced down to 41.83% if the parameters are retained. In Figure 7, the obtained DNN with the lowest error has a very low sparsity, indicating that the number of parameters in a DNN is still important to maintain learning capacity. In this case, SSL works as a regularization to add restriction of smoothness to the model in order to avoid overfitting. Figure 7 compares the results of dimensionality reduction of weight tensors in the baseline and our SSLregularized AlexNet. The results show that the smoothness restriction enforces parameter searching in lowerdimensional space and enables lower rank approximation of the DNNs. Therefore, SSL can work together with low rank approximation to achieve even higher model compression.
Besides the above analyses, the computation efficiencies of structured sparsity and nonstructured sparsity are compared in Caffe using standard offtheshelf libraries, i.e., Intel Math Kernel Library on CPU and CUDA cuBLAS and cuSPARSE on GPU. We use SSL to learn a AlexNet with high columnwise and rowwise sparsity as the representative of structured sparsity method. norm is selected as the representative of nonstructured sparsity method instead of connection pruning in Han et al. [2015a] because norm get a higher sparsity on convolutional layers as the results of AlexNet 3 and AlexNet 4 depicted in Table 4. Speedups achieved by SSL are measured by subroutines of GEMM where nonzero rows and columns in each weight matrix are concatenated in consecutive memory space. Note that compared to GEMM, the overhead of concatenation can be ignored. To measure the speedups of norm, sparse weight matrices are stored in the format of Compressed Sparse Row (CSR) and computed by sparsedense matrix multiplication subroutines.
#  Method  Top1 err.  Statistics  conv1  conv2  conv3  conv4  conv5 
1  44.67%  sparsity  67.6%  92.4%  97.2%  96.6%  94.3%  
CPU  0.80  2.91  4.84  3.83  2.76  
GPU  0.25  0.52  1.38  1.04  1.36  
2  SSL  44.66%  column sparsity  0.0%  63.2%  76.9%  84.7%  80.7% 
row sparsity  9.4%  12.9%  40.6%  46.9%  0.0%  
CPU  1.05  3.37  6.27  9.73  4.93  
GPU  1.00  2.37  4.94  4.03  3.05  
3  pruningHan et al. [2015a]  42.80%  sparsity  16.0%  62.0%  65.0%  63.0%  63.0% 
4  42.51%  sparsity  14.7%  76.2%  85.3%  81.5%  76.3%  
CPU  0.34  0.99  1.30  1.10  0.93  
GPU  0.08  0.17  0.42  0.30  0.32  
5  SSL  42.53%  column sparsity  0.00%  20.9%  39.7%  39.7%  24.6% 
CPU  1.00  1.27  1.64  1.68  1.32  
GPU  1.00  1.25  1.63  1.72  1.36  
Table 4 compares the obtained sparsity and speedups of norm and SSL on CPU (Intel Xeon) and GPU (GeForce GTX TITAN Black) under approximately the same errors, e.g., with acceptable or no accuracy loss. For a fair comparison, after norm regularization, the DNN is also finetuned by disconnecting all zeroweighted connections so that 1.39% accuracy is recovered for the AlexNet 1. Our experiments show that the DNNs require a very high nonstructured sparsity to achieve a reasonable speedup (The speedups are even negative when the sparsity is low). SSL, however, can always achieve positive speedups. With an acceptable accuracy loss, our SSL achieves on average 5.1 and 3.1 layerwise acceleration on CPU and GPU, respectively. Instead, norm achieves on average only 3.0 and 0.9 layerwise acceleration on CPU and GPU, respectively. We note that at the same accuracy, our average speedup is indeed higher than that of Liu et al. [2015] which adopts heavy hardware customization to overcome the negative impact of nonstructured sparsity. Figure 7 shows the speedups of norm and SSL on various platforms, including both GPU (Quadro, Tesla and Titan) and CPU (Intel Xeon E52630). SSL can achieve on average speedup on GPU while nonstructured sparsity obtain no speedup on GPU platforms. On CPU platforms, both methods can achieve good speedups and the benefit grows as the processors become weaker. Nonetheless, SSL can always achieve averagely speedup compared to nonstructured sparsity.
5 Conclusion
In this work, we have proposed a Structured Sparsity Learning (SSL) method to regularize filter, channel, filter shape, and depth structures in deep neural networks (DNN). Our method can enforce the DNN to dynamically learn more compact structures without accuracy loss. The structured compactness of the DNN achieves significant speedups for the DNN evaluation both on CPU and GPU with offtheshelf libraries. Moreover, a variant of SSL can be performed as structure regularization to improve classification accuracy of stateoftheart DNNs.
Acknowledgments
This work was supported in part by NSF XPS1337198 and NSF CCF1615475. The authors thank Drs. Sheng Li and Jongsoo Park for valuable feedback on this work.
References
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. 2012.

Girshick et al. [2014]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic
segmentation.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2014.  Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2015.
 He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 Liu et al. [2015] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 Han et al. [2015a] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143. 2015a.
 Han et al. [2015b] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015b.

Denil et al. [2013]
Misha Denil, Babak Shakibi, Laurent Dinh, Marc' Aurelio Ranzato,
and Nando de Freitas.
Predicting parameters in deep learning.
In Advances in Neural Information Processing Systems, pages 2148–2156. 2013.  Denton et al. [2014] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems, pages 1269–1277. 2014.
 Jaderberg et al. [2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 Ioannou et al. [2015] Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns with lowrank filters for efficient image classification. arXiv preprint arXiv:1511.06744, 2015.
 Tai et al. [2015] Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.

Yuan and Lin [2006]
Ming Yuan and Yi Lin.
Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society. Series B (Statistical Methodology), 68(1):49–67, 2006. 
Kim and Xing [2010]
Seyoung Kim and Eric P Xing.
Treeguided group lasso for multitask regression with structured
sparsity.
In
Proceedings of the 27th International Conference on Machine Learning
, 2010.  Feng and Darrell [2015] Jiashi Feng and Trevor Darrell. Learning the structure of deep convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), 2015.
 Srivastava et al. [2015] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
 Chetlur et al. [2014] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 Jia et al. [2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.