CondenseNet
CondenseNet: Light weighted CNN for mobile devices
view repo
Deep neural networks are increasingly used on mobile devices, where computational resources are limited. In this paper we develop CondenseNet, a novel network architecture with unprecedented efficiency. It combines dense connectivity between layers with a mechanism to remove unused connections. The dense connectivity facilitates feature reuse in the network, whereas learned group convolutions remove connections between layers for which this feature reuse is superfluous. At test time, our model can be implemented using standard grouped convolutions  allowing for efficient computation in practice. Our experiments demonstrate that CondenseNets are much more efficient than stateoftheart compact convolutional networks such as MobileNets and ShuffleNets.
READ FULL TEXT VIEW PDFCondenseNet: Light weighted CNN for mobile devices
The high accuracy of convolutional networks (CNNs) in visual recognition tasks, such as image classification [38, 12, 19], has fueled the desire to deploy these networks on platforms with limited computational resources, e.g.
, in robotics, selfdriving cars, and on mobile devices. Unfortunately, the most accurate deep CNNs, such as the winners of the ImageNet
[6] and COCO [31] challenges, were designed for scenarios in which computational resources are abundant. As a result, these models cannot be used to perform realtime inference on lowcompute devices.This problem has fueled development of computationally efficient CNNs that, e.g., prune redundant connections [27, 11, 29, 32, 9], use lowprecision or quantized weights [21, 36, 4], or use more efficient network architectures [22, 12, 19, 5, 16, 47]. These efforts have lead to substantial improvements: to achieve comparable accuracy as VGG [38] on ImageNet, ResNets [12] reduce the amount of computation by a factor , DenseNets [19] by a factor of , and MobileNets [16] and ShuffleNets [47] by a factor of
. A typical setup for deep learning on mobile devices is one where CNNs are trained on multiGPU machines but deployed on devices with limited compute. Therefore, a good network architecture allows for fast parallelization during training, but is compact at testtime.
Recent work [4, 20] shows that there is a lot of redundancy in CNNs. The layerbylayer connectivity pattern forces networks to replicate features from earlier layers throughout the network. The DenseNet architecture [19] alleviates the need for feature replication by directly connecting each layer with all layers before it, which induces feature reuse. Although more efficient, we hypothesize that dense connectivity introduces redundancies when early features are not needed in later layers. We propose a novel method to prune such redundant connections between layers and then introduce a more efficient architecture. In contrast to prior pruning methods, our approach learns a sparsified network automatically during the training process, and produces a regular connectivity pattern that can be implemented efficiently using group convolutions. Specifically, we split the filters of a layer into multiple groups, and gradually remove the connections to less important features per group during training. Importantly, the groups of incoming features are not predefined, but learned. The resulting model, named , can be trained efficiently on GPUs, and has high inference speed on mobile devices.
Our imageclassification experiments show that s consistently outperform alternative network architectures. Compared to DenseNets, s use only of the computation at comparable accuracy levels. On the ImageNet dataset [6], a with 275 million FLOPs^{1}^{1}1Throughout the paper, FLOPs refers to the number of multiplicationaddition operations. achieved a 29% top1 error, which is comparable to the error of a MobileNet that requires twice as much compute.
We first review related work on model compression and efficient network architectures, which inspire our work. Next, we review the DenseNets and group convolutions that form the basis for .
Weights pruning and quantization. s are closely related to approaches that improve the inference efficiency of (convolutional) networks via weight pruning [27, 11, 29, 32, 14] and/or weight quantization [21, 36]. These approaches are effective because deep networks often have a substantial number of redundant weights that can be pruned or quantized without sacrificing (and sometimes even improving) accuracy. For convolutional networks, different pruning techniques may lead to different levels of granularity [34]. Finegrained pruning, e.g., independent weight pruning [27, 10], generally achieves a high degree of sparsity. However, it requires storing a large number of indices, and relies on special hardware/software accelerators. In contrast, coarsegrained pruning methods such as filterlevel pruning [29, 1, 32, 14] achieve a lower degree of sparsity, but the resulting networks are much more regular, which facilitates efficient implementations.
s also rely on a pruning technique, but differ from prior approaches in two main ways: First, the weight pruning is initiated in the early stages of training, which is substantially more effective and efficient than using regularization throughout. Second, s have a higher degree of sparsity than filterlevel pruning, yet generate highly efficient group convolution—reaching a sweet spot between sparsity and regularity.
Efficient network architectures. A range of recent studies has explored efficient convolutional networks that can be trained endtoend [19, 46, 16, 47, 49, 22, 48]. Three prominent examples of networks that are sufficiently efficient to be deployed on mobile devices are MobileNet [16], ShuffleNet [47], and Neural Architecture Search (NAS) networks [49]. All these networks use depthwise separable convolutions, which greatly reduce computational requirements without significantly reducing accuracy. A practical downside of these networks is depthwise separable convolutions are not (yet) efficiently implemented in most deeplearning platforms. By contrast, uses the wellsupported group convolution operation [25], leading to better computational efficiency in practice.
Architectureagnostic efficient inference has also been explored by several prior studies. For example, knowledge distillation [3, 15] trains small “student” networks to reproduce the output of large “teacher” networks to reduce testtime costs. Dynamic inference methods [2, 8, 7, 17] adapt the inference to each specific test example, skipping units or even entire layers to reduce computation. We do not explore such approaches here, but believe they can be used in conjunction with s.
Densely connected networks (DenseNets; [19]) consist of multiple dense blocks, each of which consists of multiple layers. Each layer produces features, where is referred to as the growth rate of the network. The distinguishing property of DenseNets is that the input of each layer is a concatenation of all feature maps generated by all preceding layers within the same dense block. Each layer performs a sequence of consecutive transformations, as shown in the left part of Figure 1. The first transformation (BNReLU
, blue) is a composition of batch normalization
[23][35]. The first convolutional layer in the sequence reduces the number of channels to save computational cost by using thefilters. The output is followed by another BNReLU transformation and is then reduced to the final
output features through a convolution.Group convolution is a special case of a sparsely connected convolution, as illustrated in Figure 2. It was first used in the AlexNet architecture [25], and has more recently been popularized by their successful application in ResNeXt [43]. Standard convolutional layers (left illustration in Figure 2) generate output features by applying a convolutional filter (one per output) over all input features, leading to a computational cost of . In comparison, group convolution (right illustration) reduces this computational cost by partitioning the input features into mutually exclusive groups, each producing its own outputs—reducing the computational cost by a factor to .
Group convolution works well with many deep neural network architectures [43, 47, 46] that are connected in a layerbylayer fashion. For dense architectures group convolution can be used in the convolutional layer (see Figure 1, left). However, preliminary experiments show that a naïve adaptation of group convolutions in the convolutional layer leads to drastic reductions in accuracy. We surmise that this is caused by the fact that the inputs to the convolutional layer are concatenations of feature maps generated by preceding layers. Therefore, they differ in two ways from typical inputs to convolutional layers: 1. they have an intrinsic order; and 2. they are far more diverse. The hard assignment of these features to disjoint groups hinders effective reuse of features in the network. Experiments in which we randomly permute input feature maps in each layer before performing the group convolution show that this reduces the negative impact on accuracy — but even with the random permutation, group convolution in the convolutional layer makes DenseNets less accurate than for example smaller DenseNets with equivalent computational cost.
It is shown in [19] that making early features available as inputs to later layers is important for efficient feature reuse. Although not all prior features are needed at every subsequent layer, it is hard to predict which features should be utilized at what point. To address this problem, we develop an approach that learns the input feature groupings automatically during training. Learning the group structure allows each filter group to select its own set of most relevant inputs. Further, we allow multiple groups to share input features and also allow features to be ignored by all groups. Note that in a DenseNEt, even if an input feature is ignored by all groups in a specific layer, it can still be utilized by some groups At different layers. To differentiate it from regular group convolutions, we refer to our approach as learned group convolution.
We learn group convolutions through a multistage process, illustrated in Figures 3 and 4. The first half of the training iterations comprises of condensing stages. Here, we repeatedly train the network with sparsity inducing regularization for a fixed number of iterations and subsequently prune away unimportant filters with low magnitude weights. The second half of the training consists of the optimization stage, in which we learn the filters after the groupings are fixed. When performing the pruning, we ensure that filters from the same group share the same sparsity pattern. As a result, the sparsified layer can be implemented using a standard group convolution once training is completed (testing stage). Because group convolutions are efficiently implemented by many deeplearning libraries, this leads to high computational savings both in theory and in practice. We present details on our approach below.
We start with a standard convolution of which filter weights form a 4D tensor of size
, where , , , and denote the number of output channels, the number of input channels, and the width and the height of the filter kernels, respectively. As we are focusing on the convolutional layer in DenseNets, the 4D tensor reduces to an matrix . We consider the simplified case in this paper. But our procedure can readily be used with larger convolutional kernels. Before training, we first split the filters (or, equivalently, the output features) into groups of equal size. We denote the filter weights for these groups by ; each has size and corresponds to the weight of the th input for the th output within group . Because the output features do not have an implicit ordering, this random grouping does not negatively affect the quality of the layer.During the training process we gradually screen out subsets of less important input features for each group. The importance of the th incoming feature map for the filter group is evaluated by the averaged absolute value of weights between them across all outputs within the group, i.e., by . In other words, we remove columns in (by zeroing them out) if their norm is small compared to the norm of other columns. This results in a convolutional layer that is structurally sparse: filters from the same group always receive the same set of features as input.
To reduce the negative effects on accuracy introduced by weight pruning, regularization is commonly used to induce sparsity [29, 32]. In s, we encourage convolutional filters from the same group to use the same subset of incoming features, i.e., we induce grouplevel sparsity instead. To this end, we use the following grouplasso regularizer [44] during training:
The grouplasso regularizer simultaneously pushes all the elements of a column of to zero, because the term in the square root is dominated by the largest elements in that column. This induces the grouplevel sparsity we aim for.
In addition to the fact that learned group convolutions are able to automatically discover good connectivity patterns, they are also more flexible than standard group convolutions. In particular, the proportion of feature maps used by a group does not necessarily need to be . We define a condensation factor , which may differ from , and allow each group to select of inputs.
In contrast to approaches that prune weights in pretrained networks, our weight pruning process is integrated into the training procedure. As illustrated in Figure 3 (which uses ), at the end of each condensing stages we prune of the filter weights. By the end of training, only
of the weights remain in each filter group. In all our experiments we set the number of training epochs of the condensing stages to
, where denotes the total number of training epochs—such that the first half of the training epochs is used for condensing. In the second half of the training process, the Optimization stage, we train the sparsified model.^{2}^{2}2 In our implementation of the training procedure we do not actually remove the pruned weights, but instead mask the filter by a binary tensor of the same size using an elementwise product. The mask is initialized with only ones, and elements corresponding to pruned weights are set to zero. This implementation via masking is more efficient on GPUs, as it does not require sparse matrix operations. In practice, the pruning hardly increases the wall time needed to perform a forwardbackward pass during training.We adopt the cosine shape learning rate schedule of Loshchilov et al. [33], which smoothly anneals the learning rate, and usually leads to improved accuracy [18, 49]. Figure 4 visualizes the learning rate as a function of training epoch (in magenta), and the corresponding training loss (blue curve) of a trained on the CIFAR10 dataset [24]. The abrupt increase in the loss at epoch 150 is causes by the final condensation operation, which removes half of the remaining weights. However, the plot shows that the model gradually recovers from this pruning step in the optimization stage.
After training we remove the pruned weights and convert the sparsified model into a network with a regular connectivity pattern that can be efficiently deployed on devices with limited computational power. For this reason we introduce an index layer
that implements the feature selection and rearrangement operation (see Figure
3, right). The convolutional filters in the output of the index layer are rearranged to be amenable to existing (and highly optimized) implementations of regular group convolution. Figure 1 shows the transformations of the layers during training (middle) and during testing (right). During training the convolution is a learned group convolution (LConv), but during testing, with the help of the index layer, it becomes a standard group convolution (GConv).In addition to the use of learned group convolutions introduced above, we make two changes to the regular DenseNet architecture. These changes are designed to further simplify the architecture and improve its computational efficiency. Figure 5 illustrates the two changes that we made to the DenseNet architecture.
The original DenseNet design adds new feature maps at each layer, where is a constant referred to as the growth rate. As shown in [19], deeper layers in a DenseNet tend to rely on highlevel features more than on lowlevel features. This motivates us to improve the network by strengthening shortrange connections. We found that this can be achieved by gradually increasing the growth rate as the depth grows. This increases the proportion of features coming from later layers relative to those from earlier layers. For simplicity, we set the growth rate to , where is the index of the dense block, and is a constant. This way of setting the growth rate does not introduce any additional hyperparameters. The “increasing growth rate” (IGR) strategy places a larger proportion of parameters in the later layers of the model. This increases the computational efficiency substantially but may decrease the parameter efficiency in some cases. Depending on the specific hardware limitations it may be advantageous to tradeoff one for the other [22].
To encourage feature reuse even more than the original DenseNet architecture does already, we connect input layers to all subsequent layers in the network, even if these layers are located in different dense blocks (see Figure 5). As dense blocks have different feature resolutions, we downsample feature maps with higher resolutions when we use them as inputs into lowerresolution layers using average pooling.
We evaluate s on the CIFAR10, CIFAR100 [24], and the ImageNet (ILSVRC 2012; [6]) imageclassification datasets. The models and code reproducing our experiments are publicly available at https://github.com/ShichenLiu/CondenseNet.
The CIFAR10 and CIFAR100 datasets consist of RGB images of size 3232 pixels, corresponding to 10 and 100 classes, respectively. Both datasets contain 50,000 training images and 10,000 test images. We use a standard dataaugmentation scheme [30, 37, 28, 39, 41, 20, 26]
, in which the images are zeropadded with 4 pixels on each side, randomly cropped to produce 32
32 images, and horizontally mirrored with probability
.The ImageNet dataset comprises 1000 visual classes, and contains a total of 1.2 million training images and 50,000 validation images. We adopt the dataaugmentation scheme of [12] at training time, and perform a rescaling to followed by a center crop at test time before feeding the input image into the networks.
We first perform a set of experiments on CIFAR10 and CIFAR100 to validate the effectiveness of learned group convolutions and the proposed architecture.
Unless otherwise specified, we use the following network configurations in all experiments on the CIFAR datasets. The standard DenseNet has a constant growth rate of following [19]; our proposed architecture uses growth rates to ensure that the growth rate is divisable by the number of groups. The learned group convolution is only applied to the first convolutional layer (with filter size , see Figure 1) of each basic layer, with a condensation factor of , i.e., 75% of filter weights are gradually pruned during training with a step of 25%. The convolutional layers are replaced by standard group convolution (without applying learned group convolution) with four groups. Following [47, 46], we permute the output channels of the first learned group convolutional layer, such that the features generated by each of its groups are evenly used by all the groups of the subsequent group convolutional layer .
We train all models with stochastic gradient descent (SGD) using similar optimization hyperparameters as in
[12, 19]. Specifically, we adopt Nesterov momentum with a momentum weight of 0.9 without dampening, and use a weight decay of
. All models are trained with minibatch size 64 for 300 epochs, unless otherwise specified. We use a cosine shape learning rate which starts from 0.1 and gradually reduces to 0. Dropout [40] with a drop rate of was applied to train s with million parameters (shown in Table 1).Figure 6 compares the computational efficiency gains obtained by each component of CondenseNet: learned group convolution (LGR), exponentially increasing learning rate (IGR), full dense connectivity (FDC). Specifically, the figure plots the test error as a function of the number of FLOPs (i.e., multiplyaddition operations). The large gap between the two red curves with dot markers shows that learned group convolution significantly improves the efficiency of our models. Compared to DenseNets, only requires half the number of FLOPs to achieve comparable accuracy. Further, we observe that the exponentially increasing growth rate, yields even further efficiency. Full dense connectivity does not boost the efficiency significantly on CIFAR10, but there does appear to be a trend that as models getting larger, full connectivity starts to help. We opt to include this architecture change in the model, as it does lead to substantial improvements on ImageNet (see later).
In Table 1, we show the results of experiments comparing a 160layer and a 182layer with alternative stateoftheart CNN architectures. Following [49], our models were trained for 600 epochs. From the results, we observe that requires approximately fewer parameters and FLOPs to achieve a comparable accuracy to DenseNet190. seems to be less parameterefficient than , but is more computeefficient. Somewhat surprisingly, our model performs on par with the NASNetA, an architecture that was obtained using an automated search procedure over candidate architectures composed of a rich set of components, and is thus carefully tuned on the CIFAR10 dataset [49]. Moreover, (or ) does not use depthwise separable convolutions, and only use simple convolutional filters with size and . It may be possible to include as a metaarchitecture in the procedure of [49] to obtain even more efficient networks.
Model  Params  FLOPs  C10  C100 
ResNet1001[13]  16.1M  2,357M  4.62  22.71 
StochasticDepth1202[20]  19.4M  2,840M  4.91   
WideResNet28[45]  36.5M  5,248M  4.00  19.25 
ResNeXt29 [43]  68.1M  10,704M  3.58  17.31 
DenseNet190[19]  25.6M  9,388M  3.46  17.18 
NASNetA[49]  3.3M    3.41   
160  3.1M  1,084M  3.46  17.55 
182  4.2M  513M  3.76  18.47 
In Table 2, we compare our s and s with models that are obtained by stateoftheart filterlevel weight pruning techniques [29, 32, 14]. The results show that, in general, is about more efficient in terms of FLOPs than ResNets or DenseNets pruned by the method introduced in [32]. The advantage over the other pruning techniques is even more pronounced. We also report the results for in the second last row of Table 2. It uses only half the number of parameters to achieve comparable performance as the most competitive baseline, the 40layer DenseNet described by [32].
Model  FLOPs  Params  C10  C100 

VGG16pruned [29]  206M  5.40M  6.60  25.28 
VGG19pruned [32]  195M  2.30M  6.20   
VGG19pruned [32]  250M  5.00M    26.52 
ResNet56pruned [14]  62M  8.20    
ResNet56pruned [29]  90M  0.73M  6.94   
ResNet110pruned [29]  213M  1.68M  6.45   
ResNet164Bpruned [32]  124M  1.21M  5.27  23.91 
DenseNet40pruned [32]  190M  0.66M  5.19  25.28 
94  122M  0.33M  5.00  24.08 
86  65M  0.52M  5.00  23.64 
Feature map size  

Conv (stride ) 

average pool, stride 2  
average pool, stride 2  
average pool, stride 2  
average pool, stride 2  
global average pool  
1000dim fullyconnected, softmax 
In a second set of experiments, we test on the ImageNet dataset.
Detailed network configurations are shown in Table 3. To reduce the number of parameters, we prune 50% of weights from the fully connected (FC) layer at epoch 60 in a way similar to the learned group convolution, but with (as the FC layer could not be split into multiple groups) and . Similar to prior studies on MobileNets and ShuffleNets, we focus on training relatively small models that require less than 600 million FLOPs to perform inference on a single image.
We train all models using stochastic gradient descent (SGD) with a batch size of 256. As before, we adopt Nesterov momentum with a momentum weight of 0.9 without dampening, and a weight decay of . All models are trained for 120 epochs, with a cosine shape learning rate which starts from 0.1 and gradually reduces to 0. We use group lasso regularization in all experiments on ImageNet; the regularization parameter is set to .
Table 4 shows the results of s and several stateoftheart, efficient models on the ImageNet dataset. We observe that a with 274 million FLOPs obtains a 29.0% Top1 error, which is comparable to the accuracy achieved by MobileNets and ShuffleNets that require twice as much compute. A with 529 million FLOPs produces to a 3% absolute reduction in top1 error compared to a MobileNet and a ShuffleNet of comparable size. Our even achieves a the same accuracy with slightly fewer FLOPs and parameters than the most competitive NASNetA, despite the fact that we only trained a very small number of models (as opposed to the study that lead to the NASNetA model).
Table 5 shows the actual inference time on an ARM processor for different models. The walltime to inference an image sized at is highly correlated with the number of FLOPs of the model. Compared to the recently proposed MobileNet, our () with 274 million FLOPs inferences an image faster, while without sacrificing accuracy.
Model  FLOPs  Params  Top1  Top5 

Inception V1 [42]  1,448M  6.6M  30.2  10.1 
1.0 MobileNet224 [16]  569M  4.2M  29.4  10.5 
ShuffleNet 2x [47]  524M  5.3M  29.1  10.2 
NASNetA (N=4) [49]  564M  5.3M  26.0  8.4 
NASNetB (N=4) [49]  488M  5.3M  27.2  8.7 
NASNetC (N=3) [49]  558M  4.9M  27.5  9.0 
()  274M  2.9M  29.0  10.0 
()  529M  4.8M  26.2  8.3 
Model  FLOPs  Top1  Time(s) 

VGG16  15,300M  28.5  354 
ResNet18  1,818M  30.2  8.14 
1.0 MobileNet224 [16]  569M  29.4  1.96 
()  529M  26.2  1.89 
()  274M  29.0  0.99 
We perform an ablation study on CIFAR10 in which we investigate the effect of (1) the pruning strategy, (2) the number of groups, and (3) the condensation factor. We also investigate the stability of our weight pruning procedure.
The left panel of Figure 7 compares our onthefly pruning method with the more common approach of pruning weights of fully converged models. We use a DenseNet with layers as the basis for this experiment. We implement a “traditional” pruning method in which the weights are pruned in the same way as in as in s, but the pruning is only done once after training has completed (for 300 epochs). Following [32], we finetune the resulting sparsely connected network for another 300 epochs with the same cosine shape learning rate that we use for training s. We compare the traditional pruning approach with the approach, setting the number of groups is set to . In both settings, we vary the condensation factor between and .
The results in Figure 7 show that pruning weights gradually during training outperforms pruning weights on fully trained models. Moreover, gradual weight pruning reduces the training time: the “traditional pruning” models were trained for epochs, whereas the s were trained for epochs. The results also show that removing 50% the weights (by setting ) from the convolutional layers in a DenseNet incurs hardly any loss in accuracy.
In the middle panel of Figure 7, we compare four s with exactly the same network architecture, but a number of groups, , that varies between and . We fix the condensation factor, , to 8 for all the models, which implies all models have the same number of parameters after training has completed. In s with a single group, we discard entire filters in the same way that is common in filterpruning techniques [29, 32]. The results presented in the figure demonstrate that test errors tends to decrease as the number of groups increases. This result is in line with our analysis in Section 3, in particular, it suggests that grouping filters gives the training algorithm more flexibility to remove redundant weights.
In the right panel of Figure 7, we compare s with varying condensation factors. Specifically, we set the condensation factor to 1, 2, 4, or 8; this corresponds to removing 0%, 50%, 75%, or 87.5% of the weights from each of the convolutional layers, respectively. A condensation factor corresponds to a baseline model without weight pruning. The number of groups, , is set to 4 for all the networks. The results show that a condensation factors larger than 1 consistently lead to improved efficiency, which underlines the effectiveness of our method. Interestingly, models with condensation factors 2, 4 and 8 perform comparably in terms of classification error as a function of FLOPs. This suggests that whilst pruning more weights yields smaller models, it also leads to a proportional loss in accuracy.
As our method removes redundant weights in early stages of the training process, a natural question is whether this will introduce extra variance into the training. Does early pruning remove some of the weights simply because they were initialized with small values?
To investigate this question, Figure 8 visualizes the learned weights and connections for three independently trained s on CIFAR10 (using different random seeds). The top row shows detailed weight strengths (averaged absolute value of nonpruned weights) between a filter group of a certain layer (corresponding to a column in the figure) and an input feature map (corresponding to a row in the figure). For each layer there are four filter groups (consecutive columns). A white pixel in the topright corner indicates that a particular input feature was pruned by that layer and group. Following [19], the bottom row of Figure fig:learnedweightsstablity shows the overall connection strength between two layers in the condensed network. The vertical bars correspond to the linear classification layer on top of the . The gray vertical dotted lines correspond to pooling layers that decrease the feature resolution.
The results in the figure suggest that while there are differences in learned connectivity at the filtergroup level (top row), the overall information flow between layers (bottom row) is similar for all three models. This suggests that the three training runs learn similar global connectivity patterns, despite starting from different random initializations. Later layers tend to prefer more recently generated features, do however utilize some features from very early layers.
In this paper, we introduced : an efficient convolutional network architecture that encourages feature reuse via dense connectivity and prunes filters associated with superfluous feature reuse via learned group convolutions. To make inference efficient, the pruned network can be converted into a network with regular group convolutions, which are implemented efficiently in most deeplearning libraries. Our pruning method is simple to implement, and adds only limited computational costs to the training process. In our experiments, s outperform recently proposed MobileNets and ShuffleNets in terms of computational efficiency at the same accuracy level. even slightly outperforms a network architecture that was discovered by empirically trying tens of thousands of convolutional network architectures, and with a much simpler structure.
The authors are supported in part by grants from the National Science Foundation ( III1525919, IIS1550179, IIS1618134, S&AS 1724282, and CCF1740822), the Office of Naval Research DOD (N000141712175), and the Bill and Melinda Gates Foundation. We are thankful for generous support by SAP America Inc. We also thank Xu Zou, Weijia Chen, Danlu Chen for helpful discussions.
Learning the number of neurons in deep networks.
In Advances in Neural Information Processing Systems, pages 2270–2278, 2016.Imagenet classification with deep convolutional neural networks.
In NIPS, 2012.Rectified linear units improve restricted boltzmann machines.
In ICML10, pages 807–814, 2010.International Journal of Computer Vision
, 115(3):211–252, 2015.Journal of Machine Learning Research
, 15(1):1929–1958, 2014.Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.